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ABSTRACT 


Data mining can be a valuable tool, particularly in the acquisition of military intelligence. 
As the second study within a larger Naval Postgraduate School research project using 
Amazon Web Services (AWS), this thesis focuses on data mining on a very large data 
set (32 TB) with the open web crawler data set Common Crawl. Similar to previous 
studies, this research employs MapReduce (MR) for sorting and categorizing output 
value pairs. Our research, however, is the first to implement the basic Reverse Web-Link 
Graph (RWLG) algorithm as a search capability for websites, with validation that it 
works correctly. A second goal is to extend the RWLG algorithm using a full Common 
Crawl archive as input for processing as a single MR job. To mitigate the out-of¬ 
memory error, we relate some environment variables with the Yet Another Resource 
Negotiator (YARN) architecture and provide some sample error tracking methods. As a 
further contribution, this study considers limitations associated with using AWS, which 
inform our recommendations for future work. 
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CHAPTER 1: 
Introduction 


Data mining is an interesting field of study as well as a popular tool for many organizations, 
and the military is no exception. The fact that the military values intelligence acquisition as 
an extremely important asset heightens the significance of data mining even more. Thanks to 
the growth of the Internet and the use of smart phones, military activities, for example, can be 
easily observed, recorded, or photographed by military personnel, their family and friends, 
and others. This information is easily posted on any social media site (e.g., Facebook, 
Twitter) and can be handily gathered and analyzed by anyone. As a result, the movement of 
troops is almost impossible to hide nowadays. Indeed, the amount of information released 
through social media every day is so enormous that no one individual can extract specific 
useful information on his own. For example, the average 350 million photos uploaded 
to Facebook each day [1] are nearly impossible to analyze and extract useful information 
from in a reasonable time. The photos themselves are already a tremendous source of 
information, not to mention videos and text posts. This is where data mining comes into 
play. 

This is the second study within a larger Naval Postgraduate School (NPS) research project 
involving data mining with the open web crawler data set Common Crawl. The first study 
within the NPS data mining project was published in June 2016 by A. Coudray [2]. The topic 
for the present thesis was prompted by the Reverse Web-Link Graph (RWLG) algorithm 
in a paper written by Google scientists [3]. As a part of the larger NPS research project, 
the study described in this thesis shares the same services and tools, including Amazon 
Web Services (AWS), which offers various on-line computing services and management 
tools. Like the previous work [2], we use the Common Crawl data set as the input source for 
processing. We also use AWS Simple Storage Service (S3) for result storage and AWS Elastic 
MapReduce (EMR) and Elastic Compute Cloud (EC2) for data mining specific software 
and hardware. Essential background information to understand the rest of the chapters is 
provided within this work. Further details on Big Data, the Hadoop File System (HDFS), 
Hadoop MapReduce (HMR), AWS, and the Common Crawl can be found in [2]. 
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1.1 Research Problem 

The focus of this thesis is on processing an extremely large data set using the RWLG 
algorithm. In this study, RWLG is used to process the web pages as input, yielding 
ordered Uniform Resource Locator (URL) value pairs < targetURL, sourceURL > which 
constitute a map showing how web pages link to one another. The usage of RWLG can 
be exemplified by the scenario that we only have someone’s e-mail address, and we want 
to find out additional information about that person. We can use the hypothetical e-mail 
address as a key to find the corresponding links from the RWLG output-value pairs, and 
we can further look for those corresponding links to determine that person’s job, company, 
community, or profession, yielding a description of the person of interest. We can even 
find further information about this person using the corresponding links as keys to find even 
more links. 

As previously described, this research uses the RWLG algorithm, which extracts value 
pairs for one-hop links. We not only implement the RWLG algorithm but also extend the 
algorithm to extract link pairs for links of two or more hops under the MapReduce (MR) 
environment. In addition, this work considers the impact of invalid target URLs along with 
dynamic URLs, mitigates some problems encountered in previous works, and results in 
software capable of producing RWLGs on the scale of the Internet. 

In this work, the emphasis is on dealing with an extremely large data set. As mentioned 
in the military activity observation example, various kinds of information can be extracted 
from the World Wide Web (WWW). The Common Crawl compiled its latest archive of 
all the accessible text portion of the Web pages into 57 TB of compressed files as of the 
first quarter of 2017. Certainly, data mining is used almost everywhere nowadays, but only 
a few applications are required to process this extreme amount of data. Coudray’s thesis 
was a good starting point for this research project, but he encountered many issues while 
processing data that did not even approach one percent of the archive. This study tries 
different approaches to deal with a larger amount of data. 

1.2 Organization 

The literature review and background information necessary to understand this research are 
presented in Chapter 2. The tools used in the research, a guide to the environment setup, the 
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algorithms developed for this work, and proofs showing the validity of these algorithms are 
explained in Chapter 3. The results of this research are exhibited in Chapter 4. Finally, the 
work is summarized and possible avenues for future work are recommended in Chapter 5. 
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CHAPTER 2: 
Background 


To work with data mining and tap its efficiency, we need specific hardware and software. 
For hardware, a cluster of computers is preferred, and we can either set up our own cluster, 
or nowadays, rent one from a cloud services company. For software, we need a tool 
appropriate for cluster computing and parallel processing that is suitable for data mining. 
HMR, a module MR under the Apache Hadoop framework, is one of the typical tools for 
cluster computing and distributed processing. For this research, we used both hardware and 
software services provided by AWS, so we did not have to set up or maintain hardware or 
install software. The only effort required was to learn how to use their services, which is 
relatively straightforward. Besides hardware and software, we also needed a proper data 
set for data mining processing. The Common Crawl is a suitable data set on which to do 
different kinds of processing. 

Since this study is part of a larger project, we share almost the same background with [2]. 
As the first work in the NPS research project, in the Coudray thesis Big Data and HMR 
infrastructure were described in detail from a very basic foundation [2]. This study, the 
second work in the research project, provides only background sufficient to understand this 
thesis. Although the HMR is described in [2], one of the key elements in MR version 2, Yet 
Another Resource Negotiator (YARN) [4], was not and is explained here. As an internal 
resource locator in HMR, YARN was very important in this study and is described with 
some MR characteristics. Spark, which was not used in this study, is usually compared 
with MR and, therefore, is also explained here. AWS, as the platform to execute our code, 
is discussed, focusing on its changes from the time when the Coudray thesis was published. 
The Common Crawl, as the source data set, is also discussed and compared with its usage 
in Coudray’s work [2]. URLs, which play an important role in the RWLG algorithm in this 
thesis, are yet another topic discussed in this chapter. Finally, we end this chapter with a 
comparison of this study to previous research. 
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2.1 MR 

As stated in Chapter 1, the research idea stems from Google’s MR paper [3]. MR is a module 
in the Apache Hadoop framework. To work with MR requires thinking about processing 
data in a special way: make everything into value pairs that can be easily processed in 
parallel. MR gets its name from the symbolic two sequential stages: Map and Reduce. 
Mappers are the distinct processes that run the same algorithm in the Map stage, as are 
the reducers in the Reduce stage. The source data set is separated into different blocks for 
different mappers to process and produces output as < key, value > value pairs. These 
value pairs are transferred to reducers by key through a step called shuffle and sort. Each 
reducer then handles these pairs of a single key and produces the final value pairs for the 
key. MR is a simple tool that is easy to learn, write, and deploy, but which has also proved 
to be inefficient for many kinds of applications; hence, the next generation, Spark, was 
invented [5], 

2.1.1 YARN in MR Version 2 

HMR is capable of doing parallel computing on clusters, and YARN is a core element in MR 
that manages the resources on all clusters. From the description of the MapReduce official 
website [6], we notice the key parts of YARN: ResourceManager, ApplicationMaster, Node 
Manager, and Container. YARN architecture is illustrated in Figure 2.1. 

2.1.2 YARN Architecture 

We can see that one node runs the ResourceManager, and other nodes have a NodeManager 
per node. ApplicationMasters and Containers reside in the cluster nodes, and the Applica¬ 
tionMaster is in charge of its Containers. The mappers and reducers we are familiar with 
run in these Containers. Understanding the YARN architecture and knowing the relation 
between its parts is critical when setting the MR environment variables. The amount of 
memory in a Container limits the number of mappers and reducers that can be accommo¬ 
dated. The amount of memory in a node limits the number and size of the Containers that 
can be accommodated. Without this fundamental knowledge, one gets lost when setting up 
the memory parameters. 
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Figure 2.1. Description of YARN Architecture. Source: [6]. 


2.1.3 Memory Parameter Setting 

Other than introducing the MR infrastructure and mechanisms, [7] provides important 
YARN properties and some assumptions regarding memory settings. Since we are using 
EMR, we do not need to set up the YARN architecture, but we do need to set up the 
memory portion, a discuss [7] does not provide. Fortunately, [8] offers a set of equations for 
calculating a "rule of thumb" memory setting depending on the specifications of the cluster, 
which builds on top of an understanding of the YARN architecture. These introductory 
resources gave us the initial guidelines to adjust the memory setting and served as a good 
starting point in our research. 


2.2 Spark 

As mentioned earlier, Spark, which is similar to MR, is another tool for cluster computing. 
Spark is known to be more efficient than MR in many types of applications by introducing 
Resilient Distributed Dataset (RDD) [9]; thus, it was an option while starting this research. 

As there is no cache mechanism in MR, when an iterative job such as linear regression is 
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required, MR must start again from the very beginning in every cycle, causing poor MR 
performance. MR’s performance suffers as it must launch the whole process again in every 
task, as shown in Figure 2.2a. 



(a) Iterative Job Operation for MR (b) Iterative Job Operation for Spark 

Figure 2.2. Iterative Job Operation Figures for MR and Spark 


We can think of RDD as a cache that saves the mapper’s output. Once the job requires the 
previously computed value pairs, it checks the RDD first before starting the Map stage to 
run the whole data set again and, thus, saves a significant amount of execution time. Spark 
launches the process only once, and every task uses the same computed values from RDD, 
as shown in Figure 2.2b. 

Within this study’s analysis, however, the RWLG is not found by an iterative job; it does 
not require the computed data to be reused before emitting its output, at least not when the 
algorithm is created. It is more likely to have massive sorting, which is the only application 
for which MR is superior to Spark [10], [11]. MR was, therefore, chosen to be the tool for 
this research. 


2.3 AWS 

AWS provides various online services in three different types: Infrastructure as a Service 
(IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS). For the services 
of interest in this work, EC2 and S3 belong to IaaS, and EMR belongs to PaaS. There 
are actually two ways to run HMR in AWS: directly use the EMR platform provided or 
to use multiple instances from EC2 and build from the infrastructure. In both [2] and this 
thesis, EMR is used directly because it is not only cheaper but also enables us to reduce the 
complexity of building the infrastructure. Nevertheless, we do not know how Amazon has 
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configured the lower levels, which causes problems when conducting our research. Note 
that EMR runs on EC2 instances, while EC2 is transparent to EMR users. These two 
services, though, do share the same limitation, and the EMR user still needs to examine the 
EC2 documents when errors occur. 

The AWS environment changed a lot in nine months, which was the period between the 
publication of [2] and this work. There are not many conceptual changes, but a lot of 
changes were made to the functions and mechanisms by updates and version changes in a 
popular programming environment. General changes about AWS, working environment, 
tools, and pricing information are described separately in the following section. 

2.3.1 General AWS Changes 

Since both works run MR tasks, it makes sense that both works deploy these tasks using 
the EMR service from AWS. The EMR release version has changed from 4.0 to 5.0, which 
includes several changes in functions and mechanisms. The most important difference in 
this study is that the Common Crawl data can no longer be accessed by EMR if we do not 
set the appropriate region. In other words, the EMR region has to be set to "ETS EAST 
(Northern Virginia)" now, which is the same region in which the Common Crawl files 
reside, for the MapReduce tasks to work correctly. 

2.3.2 AWS GovCloud (US) 

Because NPS is a government organization, we can, and are required, to use the government 
cloud for research unless none exists, which was the case for the Coudray thesis. This 
is no longer true (since late 2016); thus, the research was moved to the AWS GovCloud 
(US). Because of the security requirements of the GovCloud, it is physically separated from 
the normal AWS Cloud. In the current AWS system settings, the separation means the 
GovCloud is unable to access Common Crawl files, which reside in the normal AWS Cloud. 
To solve these issues, we established a special account that belongs to the government and 
has the same functionality as the normal AWS Cloud. 
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2.3.3 Command Line Interface 

AWS’s Graphic User Interface (GUI) was utilized in Coudray’s work; however, this work 
involved many more experiments than Coudray’s. The GUI would not work for this number 
of experiments required since it would take much more time to assign tasks. The Command 
Line Interface (CLI) is capable of accomplishing the same thing with a long, single command 
and can be scripted to start many different tasks quickly. This characteristic greatly improved 
the speed when executing experiments. 

2.3.4 How to Choose Instances from AWS 

Instances in AWS means machines that run our jobs. Performance is always a concern when 
running experiments, and AWS offers different kinds of instances with general guidelines 
in [12] for users to execute their tasks. We found publications using AWS, but none of them 
did a systematic comparison between these instances. There are studies comparing the 
performances between different numbers of instances used [13]—[15], but there are rarely 
comparisons between different types of instances used [15], [16]. As a result, it remained 
difficult to choose the specific type of instance that suits our research. 

There are also works that examine resources and bottlenecks that constrain MR perfor¬ 
mance [17], [18]. In these studies, unexpectedly, we learned that the network capability 
is not the bottleneck in most of the MR cases. From these studies, we also learned that 
computing-optimized instances should be chosen for serialized/compressed input data pro¬ 
cessing from the Common Crawl, which is exactly our case. This assumption was tested 
and verified as discussed in later chapters. 

2.3.5 Pricing Information 

AWS gain their profit by providing services, and we, as a user, want to get the most out of 
the services while spending the least amount of money as well. There are considerations to 
think about while estimating the possible cost for a job and before executing the series of 
operations that we encountered. 
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There are three major sources to be considered while estimating the cost: 

1. AWS EMR 

When we use EMR to execute our algorithm, we are charged for using it. The EMR 
provides a "box" for different types of instances, and the boxes are charged by hour. 
The charged amount depends on the types of instances the user is using, which is 
given by AWS in [19]. 

2. AWS EC2 

We note that EMR charges only for the box, not the instances. This is because the 
instance is being charged in the EC2 part, which the price is given by AWS in [20]. 

3. AWS S3 and Data Transfer 

The input and output for our study are both using S3. The input data set, the Common 
Crawl, is saved on the "US EAST (Northern Virginia)" region of S3, as stated in the 
previous section. The output of our program is saved on the S3 buckets that we create. 
The data saved on the S3 bucket is charged daily based on its size, so we want to 
discard the resulting output when it is large (e.g., the result from multiple segments 
as input) [21]. Furthermore, data transfer between a different region of S3 buckets is 
going to be charged as listed in the bottom part of [21] as well; thus, we want to set 
up our bucket and EMR in the same region as the Common Crawl to minimize the 
data transfer fee. 

2.4 Common Crawl 

The Common Crawl is a repository of periodic archives of the Internet. It uses S3 in AWS 
as storage space and has used a crawl engine to save copies of the Internet periodically 
since 2008. The recent format of the Internet archives is the Web ARChive (WARC) file 
format [22]. Every WARC file contains many WARC records, and each WARC record 
contains HyperText Markup Language (HTML) code of a single web page and some meta¬ 
data as a record header from the crawl engine. 

For the current Common Crawl structure, an archive is a full copy of the Internet, which is 
separated into 100 segments, where each segment contains about 578 WARC files. Since 
each WARC file is about 1 GB, an archive is at least 57 TB. Such a large data set is an 
appropriate source data set for data mining and for this work. A view of the Common Crawl 
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directory using CLI is shown in Figure 2.3. 


MapreduceTest-bash — 87*31 


aws s3 Is s3://aws-publicdatasets/common-crawl/ 


2013-05-01 13:11:11 
2013-07-05 11:27:59 
2015-12-06 10:14:22 
2013-06-27 12:11:43 
2013-04-25 21:58:45 
2013-10-10 12:38:25 
2013-05-16 05:58:48 

2012- 09-03 22:03:18 

2013- 05-14 14:02:43 


PRE 2012/ 

PRE blekko/ 

PRE cc-index/ 

PRE crawl-001/ 

PRE crawl-002/ 

PRE crawl-data/ 

PRE crawl-db/ 

PRE crawl-intermediate-1365897600000/ 

PRE finnougric-urls/ 

PRE hive_analysis/ 

PRE index2012/ 

PRE mapred-temp/ 

PRE meanpath/ 

PRE nutch/ 

PRE parse-output-test/ 

PRE parse-output/ 

PRE projects/ 

PRE stats-output/ 

PRE test-job-logs/ 

PRE Wikipedia/ 

0 blekko_$folder$ 

0 crawl-intermediate-1365897600000_$folder$ 
0 hive_analysis_$folder$ 

0 index2012_$folder$ 

0 mapred-temp_$folder$ 

0 meanpath_$folder$ 

0 nutch_$folder$ 

0 parse-output-test_$folder$ 

0 parse-output_$folder$ 

■$ 


Figure 2.3. A View of the Common Crawl Directory Using CLI 


2.5 URL 

Understanding URLs is very important to this work since the output from the data mining 
in this research is nothing but URL value pairs. The regulations and definitions about these 
Internet-related concepts are regulated by documents called Requests For Comments (RFC). 
Before detailing the content of a URL, we should identify the difference between URL, 
Uniform Resource Name (URN), Uniform Resource Identifier (URI), and Internationalized 
Resource Identifier (IRI), documented in RFC1738, RFC3986, and RFC3987, respec¬ 
tively [23]-[25]. These three RFC documents describe the definitions and functionality of 
URL, URN, URI, and IRI, which are summarized in the following paragraph. 

A URL is a specified resource locator. To access a resource indicated by a URL, one must 
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know its host’s name, the directory in which the resource is stored, and the file name of the 
specific resource. The URL becomes useless if the corresponding resource is moved, which 
is a problem that resulted in the development of the URN. A URN looks like a URL, but the 
user only needs to know the resource name. By providing information such as a description 
of the resource along with the resource name, we get the most suitable copy of the resource 
from the server. URI is the superset of both URL and URN; that is, every URL is a URI, 
but a URI can be either a URL or a URN. The resource must be represented using the 
American Standard Code for Information Interchange (ASCII) character set, which causes 
a problem if the path or name of the resource contains foreign characters. IRI is intended to 
replace URI using a much wider character repository to “internationalize” it. In this work, 
we only focus on URLs that are already mature and widely used in the World Wide Web as 
well as in the Common Crawl. 

2.5.1 URL Format 

Since the URL is used to locate resources, its format, or scheme, varies depending on the 
type of resources. According to RFC1738, which is the document specifying URLs, we 
find there is a general format using the Backus-Naur form (BNF) representation: 

< scheme>: <scheme-specific-part>. 

The scheme is the type of applications used in the Internet, such as HTTP, FTP, etc. Since 
HTTP is the most common scheme encountered in this work, its BNF representation is 
described further as in the example: 

http://<host>:<port>/<path>?<searchpart> 

where <host> is the domain name in the Internet layer of the Internet protocol suite, and 
<port> is the port number in the Transport layer and is set to 80 by default if not omitted. 
The <path>, which is an optional HTTP selector that represents the file directory on the 
machine, and the <searchpart>, which is further described in Section 2.5.2 along with its 
preceding question mark, are optional as well. If all those optional parts are not used, the 
BNF representation can be simplified as follows: 
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http://<host>. 


Note that characters like and "?" are reserved in the optional parts, and <searchpart> 

plays an important role in dynamic URLs. 

2.5.2 Dynamic URL 

A dynamic URL is a URL that contains the <searchpart> part. It is used to display specific 
parts of the content of that specific page, which usually contains a huge amount of data; 
however, the Common Crawl crawls these web pages as different pages, which increases the 
complexity of the resulting RWLG output. A dynamic URL is identified by the existence 
of a "?" in the BNF representation. This study provides the option to simplify the RWLG 
output by discarding the <searchpart> part, which removes the dynamic portion of the URL. 

2.5.3 User-input-URL in HTML 

There are two values in an output pair: source URL and target URL. Since we are using 
the Common Crawl data as an input source, the source URL is always extracted from the 
WARC record header in the WARC files, which are created by the crawl engine, so it is 
less likely to have problems. Nevertheless, the target URL is extracted from the HTML 
content created by other users; thus, having an invalid URL as the target URL is possible. 
Unfortunately, to validate each URL is very computationally expensive, so we did not try to 
validate URLs in this research. Actually, there are two types of invalid URLs: one is invalid 
from syntax. These errors occur very rarely and are categorized together by alphabetical 
order since they usually share the same set of special characters. The other type of invalid 
URL is known as link rot [26]. Link rot results in a "dead end" pair, which means this error 
does not propagate in our later tasks. It is not our focus to address either the broken links 
or URL validation, so no action was taken with regard to these invalid URLs. 

2.6 Previous Research 

Many students who are using AWS have also employed MR, but no research on RWLG 
has been published. By contrast, as the first step of the NPS research project, Coudray’s 
study completes the Inverted Indices approach in the MapReduce paper, which takes the 
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web pages as input and emits < word , list(sourceURL\position) > value pairs as output. 
Although the RWLG requires a different approach, this work still benefits from the tool¬ 
choosing process and shares some of the same problems addressed in Coudray’s work, such 
as "OutOfMemory Error: Java heap space," and the AWS optimizing problem. Herein, we 
develop some solutions to increase our program’s input size capability. 
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CHAPTER 3: 
Experimental Design 


The experimental design—from the choice of tools, environment setup, algorithms to 
experiment execution, especially focusing on the portions that are different from the previous 
work—is described in this chapter. The intent of this chapter is to give a clear and full 
understanding of how these experiments are designed and why they are designed this way, 
as well as how these experiments are conducted. 

The organization of this chapter follows the research process: we first chose our tools and 
set up the environment for both local and cloud testing. Then, we designed our algorithms 
and implemented them into computer programs. Finally, we did local testing followed by 
cloud testing. 

3.1 Tools 

The previous chapters mentioned the key tools, such as AWS and the Common Crawl for 
cloud execution, but since launching a cloud execution is more expensive, we usually do a 
local test first. The local test can be run on almost any personal computer if the environment 
is properly configured. 

As a single piece of a much larger NPS project, this study shares the same tool set as 
Coudray’s [2], which is also part of that same project. Nevertheless, there are still some 
differences between our chosen tools and those of Coudray. Some of them are based on 
efficiency considerations, but most of them are just personal preferences. 

Since we must write and pack our programs on our own machine, a handy Integrated 
Development Environment (IDE) like Eclipse is preferred. For local testing, Hadoop must 
be installed on the machine. For cloud execution, AWS must be set up properly. Both this 
and [2] use Eclipse as the IDE, but the way the Java code is packed into a compressed 
Java ARchive (JAR) is different. Both works require Hadoop on the local machine, but the 
install actions are different. Both studies use Common Crawl data for input data sets, but 
the archive chosen as the input source is distinct. Both efforts use AWS for cloud execution, 
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but the interfaces chosen to access AWS are different. These differences are all explained 
in the following sections. 

3.1.1 Runnable JAR File Instead of Maven Project Packaging 

In the HMR environment, our programs always had at least two separate Java files; thus, 
packaging them into one JAR file was essential. Maven is a powerful tool; it was used 
for packaging JAR files in Coudray’s work [2]. Although it is a good tool, setting it up, 
including building the XML file, is complex just to package the JAR file. The current work 
used a much simpler way to achieve the same functionality without the need to install any 
plug-ins other than the main Eclipse IDE. The steps for using the built-in Eclipse function 
is described in later sections. 

3.1.2 Native Hadoop Instead of VirtualBox Simulation 

In Coudray’s work, local testing was done by VirtualBox simulation [2]. He simplified 
his computer system this way while sacrificing execution efficiency. In contrast, this work 
gained more benefits by directly installing Hadoop on the host operating system. 

First, deploying a virtual system requires system resources, and the execution is not able to 
run at full system capability. For the lightweight laptop that was used for this research, this 
distinction was important. 

Second, VirtualBox uses more disk space than required. To handle the simulation output, 
we have to pre-allocate more resources for the virtual system, and the machine used cannot 
afford it. 

Last, the file exchange is more complicated when using a virtual system. Without using 
a virtual system, we can pack the JAR file right at the execution directory and carry out 
testing with minimal delay. This is not possible in VirtualBox. 

These considerations caused us to install Hadoop directly on the operating system. The 
installation of native Hadoop is discussed in later sections. 
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3.1.3 Newer Version of Common Crawl Files 

The files in the Common Crawl are well organized; the numbers of WARC files in segments 
and the numbers of segments in archives are different. To be accurate, there are around 
560 WARC files in each segment and 100 segments in an archive in the first 2017 archive. 
Coincidentally, Coudray’s work [2] used a smaller-sized segment, but the current research 
used newer Common Crawl data sets. Although file size and total number of files should not 
be a problem in theory for parallel execution, we actually encountered hardware problems 
requiring mitigation. These are addressed in later sections. 

3.1.4 Accessing AWS with CLI 

GUIs are user friendly in most cases, including the AWS task configuration. The GUI is 
convenient and efficient when the number of jobs is small, but it can be time consuming to 
set up jobs using the GUI. The CLI is difficult to learn because it requires the user to learn 
many commands to carry out different functions, but it is much more efficient once the user 
learns these commands. In this research, there are many more jobs to run on different sets 
of data and instances than in the previous research; thus, using the CLI is arguably much 
more efficient. The environment setup for CLI and its usage on AWS S3/EMR is discussed 
in later sections. 


3.2 Environment Setup 

In this section, two different parts are contained: local machine setup and AWS configura¬ 
tion. Local machine setup includes the installation steps for Hadoop to a clean Mac OS, 
which can be used to write and run the MR jobs locally for basic testing and algorithm 
verification. AWS configuration describes the steps to set up privileges for CLI usage in 
the AWS as well as local setups such as account and region information, allowing direct 
execution of the AWS CLI commands from the Terminal application in the Mac OS. 

3.2.1 Local Machine Setup 

In this section, instructions to install Hadoop into a clean Mac OS, which requires specific 
software to be pre-installed, are contained. This instruction, integrating ideas from various 
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sources [7], [27], [28] with personal customization, aims to minimize the amount of human 
effort. 

3.2.2 XCode 

XCode is provided by Apple in the Mac App Store. It can be freely retrieved and should be 
installed as the very first item as it includes many functions and software that are transparent 
to the user and very useful in later steps. 

3.2.3 Java 

Java is provided by Oracle from its website (Java installation link can be found at 
https://www.java.com/) and can be retrieved at no cost. Java should be installed as the 
second item, since it is required to run Hadoop. 

3.2.4 Eclipse 

Eclipse is an IDE for many coding languages (Eclipse can be found at https://eclipse.org). 
Although Java code can be written using any text editor, an IDE helps greatly for for¬ 
matting and syntax checking as well as for packaging. We simply download and in¬ 
stall Eclipse through the GUI. There are seven libraries we must include: hadoop- 
common and hadoop-mapreduce-client-core are for Java programming involving Hadoop; 
jwat-warc, jwat-common, and jwat-archive-common are libraries that help the Java pro¬ 
grammer to access WARC files; warcutils is a library that works with JWAT li¬ 
braries to access Common Crawl WARC files; finally, the jsoup library is a Java 
HTML parser. The libraries hadoop-common, and hadoop-mapreduce-client-core are 
found at https://mvnrepository.com; the libraries jwat-warc, jwat-common and jwat- 
archive-common are found at https://sbforge.org/display/JWAT/JWAT; the library war¬ 
cutils is found at https://github.com/norvigaward/warcutils; the jsoup library is found at 
https://jsoup.org/download. 

3.2.5 Homebrew 

Homebrew is third-party package manager software for the Mac OS (Homebrew installation 
guild can be found at http://brew.sh). The advantage of using Homebrew to install software 
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is that it identifies and installs all the dependencies, which simplifies the task when the user 
needs to install software under the Mac OS. Homebrew can be installed through a single 
line Terminal command after XCode is installed: 

ruby -e "$(curl -fsSL https://raw.githubusercontent.com/ 
Homebrew/ install / master / install )" 


3.2.6 Hadoop 

After the above software is successfully installed, we install Hadoop using a single Terminal 
command: 

brew install hadoop 

Hadoop configuration is required for correct operation. Seven files under /usr/local/Cel- 
lar/hadoop/2.7.2/libexec/etc/hadoop (the path has 2.7.2 since Hadoop 2.7.2 is installed. 
Other number might appear if a different version of Hadoop is installed) need modification: 

• hadoop-env.sh 

• mapred-env.sh 

• yarn-env.sh 

• core-site.xml 

• hdfs-site.xml 

• mapred-site.xml 

• yarn-site, xml 

The suggested source listings for these seven files are listed in Appendix A. 

Remote login must be enabled to use Hadoop: check the Remote log-in option under 
"System Preferences => Sharing" first and enter the following two commands in Terminal: 

ssh-keygen -t rsa 

cat -/. ssh / id_rsa . pub >> -/. ssh / authorized_keys 

Running Hadoop involves the following commands in Terminal: 
hadoop namenode -format 

/ usr/local/Cellar/hadoop/2.7.2/libexec/sbin / start -dfs .sh 
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/ usr/local / Cellar/hadoop/2.7.2 / libexec/sbin / start -yarn . 
sh 

Note that the first command is required only for the first time after installation. To stop 
Hadoop, enter the commands: 

/ usr/local / Cellar / hadoop / 2.7.2/libexec/sbin / stop-dfs .sh 
/ usr/local/Cellar / hadoop / 2.7.2/libexec/sbin / stop -yarn . sh 

From personal experience, we can attest that these commands are not required if web 
monitoring is not desired. If only MR is needed. Section 3.4 is all that is required after the 
steps described above are done once. 

3.2.7 AWS Configuration 

The AWS configuration includes server side and client side configurations. Although we 
already have an AWS account at this time, a user account is required only for CLI access 
and should be set up on the server side. The client side setup needs to use the information 
acquired during server side setup. 

Server Side Configuration 

Log into the AWS Console => IAM => Users => Add user. Enter the desired user name 
and check the "Programmatic access" option. The user should have privileges to access S3 
and EMR, but administrator privilege also work. Remember to save the created access key 
for client side configuration. 

Client Side Configuration 

Follow the AWS CLI installation guide (the AWS CLI installation guide can be found 
at http://docs.aws.amazon.com/cli/latest/userguide/installing.html), or enter the following 
command in Terminal: 

brew install awscli 
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Configure the AWS CLI by following AWS CLI configuration tutorial (the AWS CLI 
configuration tutorial can be found at http://docs.aws.amazon.com/cli/latest/userguide/cli- 
chap-getting-started.html), or enter the following command in Terminal: 

aws configure 

Enter the Access Key information retrieved from Section 3.2.7, and enter us-east-1 for 
Default region name. Skip the Default output format by pressing enter without entering 
anything. 

3.3 RWLG 

The original idea for RWLG in the MR paper from Google [3] includes the following 
operations: 

1. For each web page, record its URL as sourceURL. 

2. For the same web page, record its link URL as targetURL[n], for n different links. 

3. Mapper’s output would be < targetURL, sourceURL > for each link. 

4. Reducer gathers all Mapper’s output and rearranges them by targetU RL as key and 
produces the following output value pairs: < targetU RL, list{sourceU RL) >. 

3.3.1 Algorithm 3-1: Basic Algorithm 

We can easily implement the basic algorithm from the original idea just described to process 
WARC files from the Common Crawl with the help of the Jsoup library. The algorithm is 
as follows with the detailed Java code source listings in Appendix B: 

1. Fact: V Ware record, 3 only 1 sourceU RL in its header. 

2. V Ware record = HTML content, the mapper emits value pairs < targetU RL, 
sourceU RL > for every targetU RL found in this specific HTML content. 

3. Reducers gather the value pairs by key {targetURL) and generate new value pairs 
< targetU RL, list {sourceU RL) > for different keys. 
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3.3.2 Algorithm 3-2: Algorithm for Far Links 

From the result of the basic algorithm, we created an extended algorithm to find link pairs 
that are two hops away or more. For example, link pair (A, C) is a two-hop link to which 
page C has a link to an intermediate page B, and page B has a link to page A. The algorithm 
follows with the detailed Java code source listing in Appendix C: 

1. Reverse the output from the basic algorithm back to < targetURL, sourceURL > 
value pairs instead of lists. 

2. For every < targetURL, sourceU RL > value pair, the mapper emits < sourceU RL, 

< tar getU RL, sourceU RL » and < targetURL, < targetURL, sourceU RL ». 
Since self-links cause problems when it comes to far links, we do not generate 
pairs for any pair that points to itself; that is, < targetURL, sourceU RL > where 
targetURL - sourceU RL. 

3. By the nature of MR, {key, value) pairs with the same key are gathered by the 
same reducer; thus, if two web pages have the same middle URL, the received 

< key,< targetURL, sourceU RL » has the form: < middleU RL, < targetURL, 
middleURL >> and < middleU RL, < middleU RL, sourceU RL ». We create 
two sets named srcRequester and srcProvider, which stand for source requester 
and source provider, respectively. We then extract targetURL into the src Requester¬ 
s' from value pairs whose key is the same as sourceU RL and extract sourceU RL 
into the srcProvider set from value pairs whose key is the same as targetURL. 

4. When the set is complete, the reducer generates its output value pairs < targetURL, 
sourceU RL > by iterating through every single element from both the src Requester 
and srcProvider sets. The output represents links that are two hops away (or more 
if src Requester is collected from links that are more than one hop away). We want 
to exclude the links that point to themselves from the output in this step as well; that 
is, exclude < targetU RL, sourceU RL > pairs that have targetURL = sourceU RL. 

5. Since middleU RL is the key, these output values are not ordered by targetURL and 
they may be repeated in other reducers. To reorder the key and reduce redundant data, 
we concatenate the result with another MR job, generating value pairs < targetU RL, 
list(sourceURL) > as final output. 

The illustration in Figure 3.1 helps us understand the mechanism in the first part of this 
algorithm. We see that a two-hop link between the target and the source is connected by the 
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middle. It is natural to use the middle as the key in the MR job since it participates in both 
sides. The mapper takes the previous link pairs as input and emits the output value pairs of 
form < key,< target, source >>. A reducer handling key=mid collects the corresponding 
value pairs, and the reducer easily extracts values from these pairs into either srcRequester 
or srcProvider sets. Once we complete the srcRequester and srcProvider, we can 
generate the two-hop-link value pairs and sort them with the target as the key through 
another MR job. 



<tar, mid> 


<mid, src^ 



Figure 3.1. Illustrated Internal Process for Algorithm 3-2 


This illustration implies that if we use //-hop link pairs on the left side and one-hop link 
pairs on the right side, we get srcRequester and srcProvider that are n+ 1 hops away; thus, 
any far links can be generated by one-hop links through this algorithm. 


3.3.3 Proof and Validation for Algorithm 3-2 

The algorithm for far links described in the previous section claims both that for every 
resulting pair < a, c >, the target a is always two-hops away from the source c, and for any 
pair < a, c > that does not belong to the resulting pair, target a is never two-hops away 
from the source c. To prove the algorithm is working, both the analytical and experimental 
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approaches are provided in this section. For the analytical proof, the algorithm is proved 
through a direct proof for the "if" part and a proof by contradiction for the "only if" part; 
for the experimental validation, a record of 66 value pairs was created for testing. 

Analytical Proof 

Define: b —> a => b has a link to a. 

Define: b —>—> a => b is two-hops away sourceURL to a. 

Define: l\ = all < a, b > value pairs from Step 2 of Algorithm 3-1. 

Define: U = all < a, b > value pairs from Step 4 of Algorithm 3-2. 

Proof: c —>—> a iff 3b s.t. b —> a and c —> b and a, b, c are all distinct. 

Fact: a ± b and b ± c from Step 2 of Algorithm 3-2; a ± c from Step 4 of Algorithm 3-2. 

1. direct proof. V < a, c >e U => c —>—> a: 

< a, c >6 1 2 => < a, c >e Step 4 of Algorithm 3-2. 

=> < a, c > is generated from Step 2 of Algorithm 3-2. 

=> 3b which is a middleURL s.t. c b and b —> a and a,b,c are all distinct 
(from Fact). 

2. proof by contradiction. V < a, c >£ Io => c —>—> a is not true. 

(a) Assume $b s.t. b —> a and c —> b. 

=> c —>—> a is not true since there is no middleURL between a and c. 

(b) Assume 3b s.t. b —> a, c —> b, and < a,c >£ 1 2 - Then < a, b >e I\ and 
< b,c >£ 1 1 since Ii has all one-hop links. 

=> Step 2 of Algorithm 3-2 generates < b, < a, b » and < b,< b,c » as output. 
=> Step 3 of Algorithm 3-2 gathers these value pairs and emits < a, c > in Step 4; 
this means < a, c >6 / 2 , which is a contradiction. 
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Experimental Validation 

A file with 66 value pair records was manually created. These records include 1-to-l 
pairs, l-to-2 pairs, 2-to-l pairs, l-to-3 pairs, 3-to-l pairs, and extra 1-to-l pairs that cause 
duplicate paths. This test file was processed by a MR job utilizing Algorithm 3-2, and the 
resulting output showed the algorithm to be correct. The algorithm caught all valid pairs 
and discarded the replicated value pairs. The samples of the previous description are shown 
in Figure 3.2. Both the test input and output files are listed in Appendix D. 
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Figure 3.2. Results for the Experimental Validation 


3.4 Experiment Execution 

How the testing and execution were done in both the local machine and AWS are described 
in this section. Note that these instructions work only after the steps described in Section 3.2 
are correctly carried out. 

3.4.1 Local Machine Execution 

The steps for JAR file packaging and MR execution from a local machine are described in 
this section. 

JAR File Packaging in Eclipse 

After selecting the working project in Eclipse, navigate "File" => "Export." 

Choose "Runnable JAR file" and click "Next >." 


27 



Choose the MR driver class file for "Launch configuration" and enter the desired directory 
for "Export destination." Select "Extract required libraries into generated JAR" for "Library 
handling" and click "Finish." 

There usually are many driver class files in a JAR file, and setting "Launch configuration" 
means we cannot use other drivers. To solve this problem, open a Terminal window and 
navigate to the directory containing the JAR file, then enter: 

vim [JAR file name] 

: 1 

Move the cursor to the line indicating META-INF/MANIFEST.MF and press enter. Press 
"a" to enter insert mode, and remove the line starting with Main-Class. Press Esc and enter: 

: wq 

:q 

The change is now saved, and we can freely choose the driver in the Hadoop command line. 

MR Execution 

Enter the following command under the same directory with the packaged JAR file in 
Terminal: 

hadoop jar [JAR filename] [Driver class name] [input 
path] [output path] 

For example, assume the JAR file is under /Users/Tao/MapReduce, the input file is un¬ 
der /Users/Tao/MapReduce/input, and the expected output directory is /Users/Tao/MapRe- 
duce/output. Move first to the same directory by 

cd /Users/Tao/MapReduce 

Assume the JAR filename is WebLink.jar and the Driver class name is LinkDriver and enter 
the following command to carry out the MR job: 

hadoop jar WebLink.jar LinkDriver input output 
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3.4.2 AWS CLI Execution 

The detailed document for the AWS CLI can always be found on the AWS website (full 
CLI documentation can be found in https://aws.amazon.com/cli/). The instruction here only 
describes commands that are used in this study. 

Upload JAR File to AWS S3 

To upload the JAR file, we first create a bucket using the following command: 
aws s3api create-bucket —bucket [Bucket name] 

Use the following command to upload the JAR file: 

aws s3 cp [JAR filename] s3://[ Bucket name]/ 

Assume the unused desired bucket name is "mapreducejob" and the JAR file name is 
"WebLink.jar." The command required is: 

aws s3api create-bucket —bucket mapreducejob 
aws s3 cp WebLink.jar s3 : / / mapreducej ob / 


Delete Resulting Output Files from AWS S3 

We mentioned previously that we are charged daily for the output file based on the file size; 
therefore, we prefer to delete any unneeded output files. The S3 GUI gets stuck when trying 
to remove a significant number of files, so we use the CLI to carry out the remove operation 
instead. The command to delete a folder and all its content in a S3 bucket is the following: 

aws s3 rm —recursive s3://[Bucket name ]/[ Folder Path]/ 

Assume the bucket name is "mapreducejob" and the folder is named "output" under the 
mapreducejob bucket. Then, the command becomes 

aws s3 rm —recursive s3 ://mapreducejob / output / 


29 



Execute MR job using AWS EMR 

There are many options in the AWS CLI for EMR, but for this research we only need to 
create a cluster and add steps. The most general command used is 

aws emr ere ate - c 1 u s t e r —steps [Step options] —release-label 
[EMR version] —instance-groups [Instance options] —auto 
-terminate — enable debugging —log-uri [Log path] —name 
[Cluster name] 

Step options are 

Type =CUSTOM_JAR,NAME=[ Step name] , ActionOnFailure = 

TERMINATE_CLUSTER, Jar =[JAR file path ], Args = [ Arguments ] 

where the arguments in our research are [Driver class name],[Input path],[Output path]; the 
up-to-date EMR version is emr-5.2.0; the instance options are: 

InstanceGroupType =MASTER, InstanceCount = l,InstanceType=[Master 
instance type], InstanceGroupType=CORE, InstanceCount = [Core 
instance count ], InstanceType =[Core instance type] 

where the master and core instance types are listed in AWS listing (available AWS EMR 
instance types can be found at https://aws.amazon.com/emr/pricing/). The core instance 
count is the number of cores wanted in this cluster. 

One typical, often used command is: 

aws emr ere ate - c 1 u s t e r —steps Type=CUSTOM_JAR, Name= 

examplestep , ActionOnFailure =TERMINATE_CLUSTER, Jar = s3 :// 
mapreducejob / WebLink . j ar , Args = LinkDri ver , s3 ://Common Crawl 
/crawl-data /CC-MAIN- 2016 -07/ segments /1454701 145519.3 3 / ware 
/ , s3 : / / mapreduce / output/exampleoutput/ —release - label 
emr-5.2.0 —instance-groups InstanceGroupType=MASTER, 
InstanceCount = l,InstanceType=c3 . xlarge InstanceGroupType = 
CORE, InstanceCount = 19, InstanceType=c3 . xlarge —auto - 
terminate —enable-debugging —log-uri ’ s3n ://aws-logs 
-270560560283 - us-east - 1/e last icmapreduce / ’ —name ’ 
examplecluster ’ 
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CHAPTER 4: 
Performance Results 


We conducted the executions with our implemented algorithms using the methodology 
described in the previous chapter, and the results are presented in this chapter in two parts. 
The execution results from local execution for a single WARC file to AWS execution on 
the cloud for one of the latest segments in 2017 are contained in the first part. The results 
of a series of attempts to finish one full archive in one AWS execution are included in the 
second. 


4.1 Experimenting with a Single Segment 

In the Coudray thesis [2], success was defined based upon cloud processing of a single 
Common Crawl segment. We followed the same path to claim success on RWLG and 
delved into a larger data set from that point. We began mining a single WARC file with 
local execution and built towards mining a full Common Crawl segment. 

4.1.1 Local Execution with One WARC File 

As described in Chapter 3, the algorithms were shown to be correct through both math¬ 
ematical analysis and a test case. We wanted to know how the algorithms worked when 
dealing with real-world data sets. Constrained by the computing power of the local ma¬ 
chine, we used only one WARC file (CC-MAIN-20170116095119-00000-ip-10-171-10- 
70.ec2.internal.warc.gz) from the Common Crawl’s most recent archive (CC-MAIN-2017- 
04) as the input. 

The local executions ran on an early-2015 13-inch MacBook Air with OSX El Capitan 
version 10.11.16 and Hadoop version 2.7.2. The laptop was equipped with a 2.2-GHz 
Intel Core i7 Processor, 8-GB 1600-MHz DDR3 memory, and an Intel HD Graphics 6000 
1536-MB graphic card. The results were separated by first-hop links and second-hop links, 
with the no-dynamic option combined and compared. 
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First-Hop Links 

We ran the execution using the basic algorithm to obtain the first-hop links. The runs were 
done five times for each option and finished without error. The average resulting metrics 
are listed in Table 4.1. The result for the no-dynamic option finished faster and was smaller 
in size, which was expected. 


Table 4.1. Local Execution Result Metrics for First-Hop Links 


Option 

Execution Time 

Input Size 

Output Size 

Normal 

6 min 4 sec 

1.02 GB 

831.8 MB 

No-dynamic 

5 min 54 sec 

1.02 GB 

637 MB 


Second-Hop Links 

We ran the execution using the algorithm for far links to obtain the second-hop links. We 
noted that there was no "no-dynamic" option in this algorithm. Since we used the first-hop 
links as input, we had no-dynamic second-hop links if the no-dynamic first-hop links were 
given. The runs were also done five times each and finished without error. The average 
resulting metrics are listed in Table 4.2. 


Table 4.2. Local Execution Result Metrics for Second-Hop Links 


Input 

Execution Time 

Input Size 

Output Size 

Normal 

2 min 17 sec 

832.2 MB 

8.7 MB 

No-dynamic 

1 min 46 sec 

637.4 MB 

41.9 MB 


The result with no-dynamic input finished faster because the input size was smaller. Since 
many dynamic URLs were treated as no-dynamic ones, more links became more relevant; 
thus, the output size for the no-dynamic option became larger as expected. 


4.1.2 AWS Execution with One WARC File 

The algorithms were shown to be working with a mathematical analysis, test case, and local 
testing. Because it is very expensive to own and maintain a cluster capable of processing 
tens of terabytes of input data, we employed AWS EMR. We wanted to know how the 
algorithms worked in the cloud environment using AWS EMR and compared the result 
with the previous local testing before we launched our job with large data sets. To see 
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how the same task could be carried out in AWS, we executed the algorithms on the same 
WARC file (CC-MAIN-20170116095119-00000-ip-10-171-10-70.ec2.internal.warc.gz) as 
the local execution using only one EC2 instance. 

The EC2 instance used here was of the type c3.4xlarge, and the reason we chose this type of 
instance is described in Section 4.2.2. The c3.4xlarge instance has 16 virtual CPUs using 
Intel Xeon E5-2680 v2 (Ivy Bridge) Processors and was equipped with 30-GB memory and 
two 160-GB SSDs. 


First-Hop Links 

We repeated mining a single WARC file for first-hop links, but this time used cloud com¬ 
puting. The runs finished without error, and the average resulting metrics are listed in 
Table 4.3. 

Table 4.3. AWS Execution Result Metrics on One WARC File for First-Hop 
Links with One Instance 


Option 

Execution Time 

Input Size 

Output Size 

Normal 

6 min 

1.02 GB 

825.3 MB 

No-dynamic 

5 min 

1.02 GB 

632.1 MB 


The result with the no-dynamic option appeared to be 6.5 MB smaller in size than the 
result in local execution. Since the default number of reducers in the AWS for this run was 
seven, we were not able to directly compare the output files. To do so, we downloaded the 
first-hop-link result from AWS and ran a sorting MR job on the results both from AWS and 
local execution. We used the dijf command to compare the two output files, and found 
only two key values in millions of records that had differences in a few bytes. We could 
not understand what caused the difference since we were using the same JAR to run from 
the identical input, but the difference should be around 10 bytes instead of 6.5 MB. We 
concluded the output of the AWS execution was consistent with the local execution, and the 
difference in size might be due to the various sizing mechanisms of different file systems. 
The source listing for the sorting MR job is listed in Appendix E. We also noticed that the 
execution time for a single c3.4xlarge instance and our local execution were similar. 
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Second-Hop Links 

We repeated mining the first-hop results to produce the second-hop results, but this time 
using cloud computing. The runs finished without error, and the average resulting metrics 
are listed in Table 4.4. 

Table 4.4. AWS Execution Result Metrics on One WARC File for Second- 
Hop Links with One Instance 


Input 

Execution Time 

Input Size 

Output Size 

Normal 

1 min 

825.3 MB 

8.6 MB 

No-dynamic 

1 min 

632.1 MB 

41.6 MB 


We did the same comparison for the second-hop-link results as described previously, and 
we got a similar result after sorting. The time needed to run these AWS executions was 
shorter than for local execution because the c3.4xlarge instance had 16 virtual CPUs and 
30-GB memory. Since we had six output files from the first-hop-link result, the c3.4xlarge 
instance ran more mappers at a time, achieving a faster average execution time than the 
local machine. 

4.1.3 Execution on a Segment 

We now knew that our code worked on a local machine as well as in the cloud environment. 
In the following sections, we explore how to process one entire archive at once, which 
consists of hundreds of segments and thousands of WARC files. We started executing on 
a segment (s3://commoncrawl/crawl-data/CC-MAIN-2017-04/) from the latest Common 
Crawl archive using 20 EC2 instances. We originally used m2.xlarge instances but later 
switched to c3.4xlarge for our work since the latter had better cost-performance. We describe 
the steps to determine the cost-performance between different instances in Section 4.2.2. 
The results are presented in the following discussion. 

First-Hop Links 

We executed algorithm 3-1 for first hop links on AWS with 20 c3.4xlarge instances. We 
only ran it once for each option. The runs finished without error, and the resulting metrics 
are listed in Table 4.5. 
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Table 4.5. AWS Execution Result Metrics on One Segment for First-Hop 
Links with 20 Instances 


Option 

Execution Time 

Input Size 

Output Size 

Normal 

31 min 

593.9 GB 

363.3 GB 

No-dynamic 

31 min 

593.9 GB 

258.3 GB 


The execution time for both the normal and the no-dynamic option was similar to that for 
previous executions and remained the same for the one-segment execution. The resulting 
output size for the normal execution was 61.2% of the original input size, down from 79% 
from the result for a single WARC file. This was also found in the resulting size for the 
no-dynamic execution: down to 32.5% from 60.5% output size. Since different web pages 
could have links to the ones already processed, the ratio change in output size was expected. 


Second-Hop Links 

We then successfully mined the first-hop results for a full Common Crawl segment to 
produce the second-hop results using cloud computing; however, we were not able to 
finish the two-hop-link execution with the no-dynamic option after 14 tries with different 
configurations. The results are shown in Table 4.6. 

Table 4.6. AWS Execution Result Metrics on One Segment for Second-Hop 
Links with 20 Instances 


Input 

Execution Time 

Input Size 

Output Size 

Normal 

2 hr 54 min 

363.3 GB 

368.8 GB 

No-dynamic 

failed between 2-3 hrs 

258.3 GB 

- 


The apparently longer execution time is not investigated until Section 4.2.3. To further 
explore the failure issue, we reduced the input from a segment to multiple WARC files and 
compared the resulting size between different numbers of WARC files as shown in Table 4.7. 

We could clearly see that the size of the one-hop-link output grew almost linearly as the 
input size increased. In comparison, the two-hop-link output grew much faster. Since we 
had about 560 WARC files in one segment, it was reasonable to expect more than 500 GB 
for the output of a segment. We know the total memory used for processing this data is 
600 GB distributed among 20 c3.4xlarge instances, which also must allocate memory to the 
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Table 4.7. AWS Execution Output Size Comparison between Different Num¬ 
bers of WARC Files with 20 Instances 


Input File 

First-hop-link Execution 

Second-hop-link Execution 

Execution Time 

Output Size 

Execution Time 

Output Size 

5 WARCs 

6 min 

2.8 GB 

2 min 

392.5 MB 

10 WARCs 

6 min 

5.4 GB 

2 min 

1.1 GB 

20 WARCs 

7 min 

10.2 GB 

3 min 

3.2 GB 

40 WARCs 

10 min 

19.6 GB 

6 min 

9.4 GB 

80 WARCs 

15 min 

38.1 GB 

14 min 

29.1 GB 


operating system and background programs. On the other hand, the output size for a second- 
hop-link execution on one segment was already twice the size as a first-hop-link output. 
Since far-distanced pairs could always be generated by first-hop links, we recommend only 
computing it on specified pages when requested in order to save storage space. 


4.2 Scaling AWS Configurations for Many Segments 

From the executions in the previous sections, we established that our code worked and 
was capable of distributed processing and parallel computing in the cloud environment; 
however, as pointed out in the Coudray thesis, this type of work would encounter the "out of 
memory" error at some point [2]. We do not want our work to be limited by the size of the 
data set, as the ultimate goal is to process one entire archive at once. Herein, we describe 
our effort to achieve this goal, step by step. 

4.2.1 Challenges 

If we need to process a large data set, we do not want to manually assign the separated data 
sets and aggregate their results on demand. An archive in the Common Crawl is an image 
of the Internet at a specific time, and we desire to automatically process one full archive at 
once. We expected to achieve this goal by parallel computing, but we encountered an "out 
of memory" error. 

The AWS provided various types of instances to implement our jobs, but it seemed that 
AWS did not provide any auto-scaling ability that could adapt to the different jobs assigned. 
Using the default environment setting for 20 m2.xlarge instances, we succeeded in the 
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execution of three 2016 Common Crawl segments with a total of about 1 TB as input at a 
time without error, but we could not achieve any more segments beyond that point. 

We thought the AWS would provide almost infinite hardware resources (e.g., CPU, disks, 
and memory) upon our request, which should have been enough to run as many segments as 
we wanted. The truth was that, depending on the algorithm we used in MR, we encountered 
the "out of memory" error when processing a very large data set. We encountered the 
"out of memory" error until we ran the execution from four segments using 20 m2.xlarge 
instances. The error might also have been related to inefficient resource usage that led to 
poor execution efficiency, so we needed both to find a way to eliminate this error, as well as 
to run the execution more efficiency at the same time. 

4.2.2 Initial Approach: Ad Hoc 

The first approach was ad hoc. We adjusted the type and number of EC2 instances and 
increased the number of reducers to execute the job. This was a trial and error method. 
We only executed Algorithm 3-1 for this approach, and the starting point was from a failed 
execution on four 2016 Common Crawl segments at a 95.24% progression rate (1520/1596 
tasks completed) using 20 m2.xlarge instances, with six hours and 54 minutes of execution 
time. 

Number of Reducers 

In our previous executions using m2.xlarge instances, we found that the output size was 
proportional to the input size, as shown in Table 4.8. 


Table 4.8. Output and Input Size vs. Number of Segments 


Number of 
Segments 

Single Output 
File Size 

Number of 
Output Files 

Total Output Size 

1 

1.5 GB 

130 

195 GB 

2 

3 GB 

130 

390 GB 

3 

4.5 GB 

130 

585 GB 


The number of output files was equal to the default number of reducers, which was closely 
related to the number and type of instances we were using. Since the "out of memory" 
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error usually occurred in reducer jobs because of Java heap space, we guessed that some 
variations on the number of reducers might have a good effect. 

First, we tried lowering the number of the reducers from 130 to 60, but the "out of memory" 
error came earlier, in three hours. We then tried to increase the number of the reducers 
from 130 to 200, but we got the "out of memory" error again because of Java heap space 
after ten hours and five minutes at a 98.25% progression rate (1568/1596 tasks completed). 
We also got another error message: "Too [Mjany fetch failures." This new error message 
appeared because the reducers took too much time to swap, which is known as a memory 
paging problem. Combined with the apparently longer execution time, we thought enlarging 
the number of the reducers would improve the execution progression, but if we used too 
many reducers, the memory paging problem caused execution failure. Considering the 3% 
extended progression required three more hours in execution time, the swapping takes a 
significant amount of time and results in fetch failure because of timeouts. Considering that 
we had 100 segments in an archive, increasing the number of the reducers does not solve 
our problem. 


Types of Instances 

We could always use more instances to process our work, but we wanted to use them wisely. 
We knew different types of instances would be suitable for different jobs, as discussed in 
Chapter 2. To determine the right type, we ran the basic algorithm five times on the same 
single segment using different types of 20 instances, and the result is shown in Table 4.9. 


Table 4.9. Execution Time vs. Different 


Ty 


pes of Instances (Same Segment) 


Type 

1st run 

2nd run 

3rd run 

4th run 

5th run 

Avg (min) 

m3.xlarge 

69 

66 

68 

60 

63 

65.2 

r3.xlarge 

62 

58 

61 

58 

59 

59.6 

c3.xlarge 

83 

73 

68 

68 

68 

72 

d2.xlarge 

57 

57 

55 

57 

57 

56.6 

i2.xlarge 

61 

64 

60 

64 

66 

63 

g2.2xlarge 

41 

41 

40 

41 

41 

40.8 


We could see that g2.2xlarge is the fastest of the different types of basic instances; however, 
we had to take the price per hour (hr) for instances into consideration. We could also 
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calculate the price per segment (seg) from the pricing information and the execution time, 
as listed in Table 4.10. 


Table 4.10. Advanced Comparison between Instance Types 


Type 

Avg (min) 

Avg (hr) 

price/hr 

price/seg 

m3.xlarge 

65.2 

1.087 

0.07 

0.0761 

r3.xlarge 

59.6 

0.993 

0.09 

0.0894 

c3.xlarge 

72 

1.2 

0.053 

0.0636 

d2.xlarge 

56.6 

0.943 

0.173 

0.1632 

i2.xlarge 

63 

1.05 

0.213 

0.2237 

g2.2xlarge 

40.8 

0.68 

0.2 

0.136 


We could see that although the price/hr for g2.2xlarge was relatively high, its price/seg 
was quite low when the processing speed is taken into account. There are still two points 
to be noticed here: we considered in Chapter 2 that computation-optimized instances 
outperformed other types, and the price/hr for the basic m3.xlarge, r3.xlarge and c3.xlarge 
instances were much lower than the other three different types of basic instances. We did 
the runs again using the advanced instances, c3.4xlarge (four times more CPUs, memory, 
and disk size than c3.xlarge) and r3.2xlarge (twice the resources than r3.xlarge), which had 
similar price/hr to g2.2xlarge (type m3 did not have one with similar pricing), and the result 
is shown in Table 4.11. 

Table 4.11. Advanced Comparison between Instance Types in the Same 

Price Range 


Type 

Avg (min) 

Avg (hr) 

price/hr 

price/seg 

r3.2xlarge 

41.4 

0.69 

0.18 

0.1242 

c3.4xlarge 

21 

0.35 

0.21 

0.0735 

d2.xlarge 

56.6 

0.943 

0.173 

0.1632 

i2.xlarge 

63 

1.05 

0.213 

0.2237 

g2.2xlarge 

40.8 

0.68 

0.2 

0.136 


We found c3.4xlarge to be the best cost-performance instance type when doing our algo¬ 
rithm, and it required the least amount of time, as expected. We note that these results 
were only for AWS EMR pricing. The instances (ins.), which are charged by AWS EC2 
pricing, should also be considered in the actual cost. The final cost results in an even greater 
difference as the number of instances increases, as shown in Table 4.12. Nonetheless, this 
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result was much better than the m2.xlarge instance we used previously, in which it took 
almost two hours to finish the execution from a single segment and encountered the "out 
of memory" error because of Java heap space after almost seven hours of execution time 
on four segments. We executed the algorithm again on four segments, using 20 c3.4xlarge 
instances, and it finished in 71 minutes without error. We decided to use c3.4xlarge as the 
instance type to execute our jobs since it had the fastest execution time for our job, and 
its price per segment was almost the lowest one, about three times faster than c3.xlarge in 
execution time and only 15% higher in price/seg for EMR. 


Table 4.12. Combined Cost Comparison between Instance Types 


Type 

Avg 

(hr) 

EMR 

pr/hr 

EC2 

pr/hr 

20 ins. 
pr/seg 

40 ins. 
pr/seg 

60 ins. 
pr/seg 

80 ins. 
pr/seg 

100 ins. 
pr/seg 

r3.2xlarge 

0.69 

0.18 

0.665 

9.3 

18.48 

27.66 

36.83 

46.01 

c3.4xlarge 

0.35 

0.21 

0.84 

5.95 

11.83 

17.71 

23.59 

29.47 

d2.xlarge 

0.943 

0.173 

0.69 

13.18 

26.19 

39.20 

52.22 

65.23 

i2.xlarge 

1.05 

0.213 

0.853 

18.14 

36.05 

53.96 

71.88 

89.79 

g2.2xlarge 

0.68 

0.2 

0.65 

8.98 

17.82 

26.66 

35.5 

44.34 


Number of Instances 

As soon as we succeeded on four segments using 20 c3.4xlarge instances, we tried the full 
archive using 100 c3.4xlarge instances, and we failed. After 12 hours, we received the "out 
of memory" error because of both "Java heap space" and "GC overhead limit exceeded." 
We then reduced the number of segments to 18, a number close to 20 and easily assigned 
because of the naming convention in the Common Crawl, to try again since we already knew 
that 20 instances could finish four segments at once. If we were doing parallel computing, 
then 100 instances should have been more than enough for a fluent execution from 18 
segments. After 82 minutes of execution, the "out of memory" error appeared again with 
a 98.77% progression rate (6979/7066 task completed), and we realized that we were not 
able to solve this error by simply increasing the number of instances. 

Conclusion for Ad Hoc Approach 

The most beneficial information we learned from the ad hoc approach was that c3.4xlarge 
is the best choice over different instance types for both cost-performance and speed in this 
study. We also learned that slightly increasing the number of the reducers can possibly 
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overcome the "out of memory" error if the task progression is high enough before the error 
occurs. Furthermore, we discovered that increasing the number of instances was not likely 
to solve the "out of memory" problem caused by "Java heap space." We realized the need 
to dive deeper into the environment settings in HMR in order to achieve the goal. 


4.2.3 Systematic Approach 

After the ad hoc approach, we realized that the AWS does not provide auto-scaling when 
we use more instances, and we adjusted the environment variables, especially for the 
memory. We then revisited the YARN architecture and tried to understand how to adjust 
the environmental variables. Fortunately, we discovered a rule of thumb with which to 
begin [8]. We also looked into the previous error message and sorted out a list of possible 
variables related to various types of errors. The different types of errors, the corresponding 
variables that might remove the errors, and the related files in which each variable resided 
are listed in Table 4.13. In this approach, we adjusted each variable in order to understand 
how they affect the results. 


Table 4.13. Types of Error and Possible Corresponding Environmental Vari¬ 
ables 


Error Type 

Corresponding Variable 

Related File 

Java heap space 

yarn.nodemanager.resource.memory-mb (1) 
yarn.scheduler.minimum-allocation-mb (2) 
yarn. scheduler.maximum-allocation-mb (3) 
mapreduce.task.io.sort.mb 
mapreduce.map.memory.mb (4) 
mapreduce.reduce.memory.mb (5) 
mapreduce.map.java.opts (6) 
mapreduce.reduce .java.opts (7) 
yarn.app.mapreduce.am.resource.mb (8) 
yarn.app.mapreduce.am.command-opts (9) 

yarn-site.xml 

mapred-site.xml 

Too many 
fetch failures 

mapreduce.job .reduce.slowstart.completedmaps 
mapreduce.shuffle.max. threads 
mapreduce.reduce.shuffle.parallelcopies 

mapred-site.xml 

GC overhead 
limit exceeded 

HADOOP_DATANODE_OPTS 

HADOOP_CLIENT_OPTS 

hadoop-env.sh 


To change the environment variable through AWS CLI, we created a configuration file and 
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referred to the file named "configurations.json" when creating the cluster with the following 
argument: 

— configurations file ://configurations . json 
The configuration file had a certain format, which is described in Appendix F. 


Stage 1: 18 Segments 

We failed while executing from the input of 18 segments, so we used 18 segments as the 
starting point for this approach. Since the "out of memory" error was our major concern, we 
focused on the memory-related parameter adjustments. We concluded four different settings 
toward successful execution, and the variable set used is shown in Table 4.14. Because of 
the long variable name, we indicated the variable using the number assigned in Table 4.13. 


Table 4.14. Variable Sets and Results for 18 Segments 


Item 

1 

2 

3 

4 

# of instances 

20 

20 

21 

21 

var(1) 

20480 

614400 

614400 

35840 

var(2) 

4096 

8500 

8500 

7168 

var(3) 

20480 

614400 

614400 

35480 

var (4) 

4096 

8500 

8500 

7168 

var(5) 

8192 

17000 

17000 

14336 

var(6) 

-Xmx3276m 

-Xmx6800m 

-Xmx6800m 

-Xmx5734m 

var(7) 

-Xmx6553m 

-Xmx 14600m 

-Xmx 14600m 

-Xmx 11469m 

var(8) 

8192 

17000 

17000 

14336 

var(9) 

-Xmx6553m 

-Xmx 14600m 

-Xmx 14600m 

-Xmx 11469m 

execution time 

llhr29min 

9 min 

5 min 

1 lhr5min 

progression 

99.61% 

0% 

0% 

100% 

error code 

143 

137 

137 

- 


We found a post on the Internet that also ran a comparably large MR job and got the "out of 
memory" error [29]. The originator of the post finally succeeded using a set of environment 
variables and shared this information freely. We used this setting with only 20 instances 
to run the first execution, which took about 11.5 hours before the "out of memory" error 
occurred and reached a 99.67% progression rate, which is almost 1 percent better than 
the previous execution using 100 instances with the default setting. This proved adjusting 
memory-related parameters helped to improve the outcome of the execution. 
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Unfortunately, there is only a little information as to how to set these numbers. Most 
guidance found in AWS manuals and on the Internet, which were reasonable but useless, 
indicated that these variables were important and needed to be adjusted depending on the 
job we were running. The only reference we found described a general rule to calculate 
the recommended memory setting depending on system capacities [8]. We followed the 
instruction to set the environment variables for the second execution. It failed because the 
system could not allocate the assigned size of memory, which we thought was because of a 
feature provided by AWS EMR. As stated in Chapter 2, EMR is a PaaS, which hides many 
system details from the user. In the second execution, one of our 20 instances was assigned 
as a "master" instance. It only ran the ResouceManager application but neither map nor 
reduce jobs. For the third execution, we increased the requested number of instances to 21, 
so that we would have 20 instances running mappers and reducers. 

The third execution also failed due to memory allocation. We realized that there were some 
inconsistencies between [8] and EMR. We did not investigate this issue at this stage, but 
we thought the unrealistic memory variables var (1) and var (3) might be the problem. We 
tried to lower these numbers and finally succeeded using the fourth setting, with about 11 
hours of execution time. 

At this stage, we established the starting point to properly set up the AWS EMR environment 
for the jobs to successfully execute a much larger job than the previous executions. We 
learned that error code 143 was for the "out of memory" issue, and error code 137 was 
related to the memory configuration error. We achieved executing 18 segments as input. 
This was six times larger than three segments as input, which was our original capability 
using 20 m2.xlarge instances. 

Stage 2: 34 Segments 

After our success with 18 segments, we used the same setting to run the full archive 
and failed. We knew that we did not fully understand the memory settings, such as the 
inconsistencies between [8] and EMR, and we needed more adjustments to gain a better 
understanding of the memory settings. We reduced the input size to 34 segments and 
executed with 41 instances, where one of them was assigned as the "master" using the 
successful setting in 18 segments and again got the "out of memory" error. We ran different 
executions, which yielded the eight results captured in Table 4.15 and Table 4.16. As on 
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the previous stage, the variables are indicated by the number assigned in Table 4.13. 


Table 4.15. Variable Sets and Results for 34 Segments (part 1) 


Item 

1 

2 

3 

4 

# of instances 

41 

41 

41 

41 

var(1) 

70656 

35328 

70656 

25500 

var(2) 

7065 

7065 

7065 

7065 

var(3) 

35328 

35328 

70656 

25500 

var (4) 

7065 

7065 

7065 

- 

var(5) 

14131 

14131 

21504 

14500 

var(6) 

-Xmx5652m 

-Xmx5652m 

-Xmx5652m 

- 

var(7) 

-Xmxl 1305m 

-Xmxl 1305m 

-Xmxl 7203m 

-Xmxl 1500m 

var(8) 

14131 

14131 

21504 

14500 

var(9) 

-Xmxl 1305m 

-Xmxl 1305m 

-Xmxl 7203m 

-Xmxl 1500m 

execution time 

4hr24min 

12hrs 

6hrl7min 

6hr33min 

progression 

70.34% 

99.95% 

97.23% 

99.95% 

encountered error 

137 

143 

137 

143 


Table 4.16. Variable Sets and Results for 34 Segments (part 2) 


Item 

5 

6 

7 

8 

# of instances 

41 

41 

45 

41 

var(1) 

25500 

25500 

25500 

23000 

var (2) 

250 

250 

250 

250 

var(3) 

25500 

25500 

25500 

23000 

var(5) 

22500 

14000 

14131 

16500 

var(7) 

-Xmxl 8000m 

-Xmxl 1000m 

-Xmxl 1305m 

-Xmxl 3500m 

var(8) 

22500 

14000 

14131 

16500 

var(9) 

-Xmxl 8000m 

-Xmxl 1000m 

-Xmxl 1305m 

-Xmxl 3500m 

execution time 

7hr56min 

6hr31min 

6hr7min 

6hr23min 

progression 

70.34% 

97.23% 

99.95% 

100% 

encountered error 

143 

143 

143 

- 


We started by determining the proper setting for variables var (1) and var (3) with different 
settings between set 1 and 3. We also compared the relations between the element com¬ 
ponents of YARN, the results from AWS EMR, and the reason why variables were set as 
in [8], 
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After a series of executions, it became clear that the settings in [8] are for a cluster of a 
single instance. This explains why variables var (1) and var (3) add up to the total memory 
in a single instance in [8]. In EMR, every core instance was a single computer running 
both the DataNode and NodeManager daemons. Since we used c3.4xlarge instances as 
core instances with the default setting information given in [30], [31], the maximum total 
memory was around 27 GB per instance and should be the number for variables (1) and 
(3). We also assumed that the size of each container was decided by variables (2) and (3), 
which was first set by variable (2) and was increased upon request by multiples of variable 
(2) until it reached the number indicated in variable (3). 

Reference [8] indicates that the reducer memory allocation, var (5), should be twice the 
mapper memory allocation, var (4). Furthermore, var (2) is the minimum container memory 
allocation and should be the greater common divisor of var (4) and var (5). We also found 
that we did not have to set a larger number for variable (4) because the default value for 
the mapper is always larger than needed, and we never encountered errors in mapper tasks. 
Actually, we removed variables (4) and (6) and achieved a much faster execution time, 
which saved more than five hours for the same progression rate. This was because we can 
have many more mappers running at the same time. Tracking the causes of errors became 
easier once we knew what to adjust by the given error code. 

Since it increased the execution time, we removed the modification on mapper memory 
parameters in sets 5 through 7. We slightly adjusted the reducer memory in sets 5 and 
6 using set 4 as a template and tried increasing the number of instances in set 7, but all 
executions for variable sets 5, 6, and 7 failed. 

We noted that the progression for set 5 was surprisingly low, so we looked into the error 
message in each failing task. The general error message output from the cluster console did 
not give any useful information, so we had to track down the error to each assigned task and 
each container. We found that they failed not because of the "out of memory" error as given 
in the cluster console. Instead, they failed because the reducer task had issues connecting 
to S3 when emitting the result, which happened rarely when many users were accessing S3 
at the same time. We decided to use a larger reducer memory setting to execute, as listed in 
set 8, for which the amount was decided by observing the trends of progression and reducer 
memory. We finally succeed after about 6.5 hours. 
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At this stage, we learned the meaning of most of these memory-related variables and the 
roles they played in the YARN model. We also learned how to track down the source of an 
error to find the actual reason for the problem, instead of blindly believing the given error 
message. Hints for tracking an error message are shown in Appendix G. We also realized 
that as a PaaS, AWS EMR had some downsides. EMR depended on S3 to save the output 
and did not keep the cluster once the job was terminated. This meant our effort was lost 
once there was a connection inconsistency between EMR and S3. 


Stage 3: Full Archive 

From the previous stages, we thought we learned enough about YARN, and we started the 
execution using one full 2016 Common Crawl archive (CC-MAIN-2016-07) as the input on 
100 c3.4xlarge instances. The results are shown in Table 4.17. 


Table 4.17. Variable Sets and Results for the Full Archive 


Item 

1 

2 

3 

# of instances 

100 

100 

100 

yarn.nodemanager.resource.memory-mb 

27468 

27468 

27468 

yarn.scheduler.minimum-allocation-mb 

1024 

160 

32 

yarn.scheduler.maximum-allocation-mb 

27468 

27468 

27468 

mapreduce.map.memory.mb 

1024 

- 

- 

mapreduce .reduce. memory, mb 

20480 

20480 

22528 

mapreduce.map.java.opts 

-Xmx864m 

- 

- 

mapreduce .reduce.java.opts 

-Xmxl6384m 

-Xmxl6384m 

-Xmx 18022m 

yarn.app.mapreduce.am.resource.mb 

20480 

20480 

22528 

yarn. app. mapreduce. am. command-opt s 

-Xmxl6384m 

-Xmxl6384m 

-Xmx 18022m 

execution time 

7hr8min 

7hr37min 

7hr51min 

progression 

99.99% 

99.99% 

99.99% 

encountered error 

143 

143 

143 


From previous executions, we thought the mapper would suffice with just 1-GB memory, 
which was the default for m2.xlarge for faster execution speed. We set the mapper memory to 
be lower than the default setting because if the job finished earlier, the price would be much 
lower. We increased the reducer memory because the ratio between number of instances 
and number of segments had been decreased to one. The resulting set 1 failed at a 99.99% 
progress rate since the assigned 1-GB memory limitation caused two out of 35684 assigned 
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tasks to fail, with one task cancelled because of the failed tasks. We fixed the problem by 
removing the map memory settings as in the variable set 2, but we again encountered the 
"out of memory" error from the reducers. We increased the reducer memory settings as 
in the variable set 3. We still had the "out of memory" error, which indicated that we still 
did not allocate enough memory. We knew that we could succeed at the price of a longer 
execution time if we assigned more memory to the reducer because the number of mappers 
running at the same time would be reduced. Because we had some administration issues to 
run more executions on the AWS and the 99.99% complete MR job had generated usable 
output with 18.3 TB in size, we did not continue with further executions. 

Conclusion for Systematic Approach 

We learned a lot with the systematic approach. We now clearly understood the YARN 
architecture and how it was affected by the environment variables. We learned how to track 
down the error message to the source of the problem in either a mapper/reducer task or a 
container. We also learned the limits of EMR as a PaaS. Most importantly, we essentially 
achieved our original goal: to complete a full archive in a single execution. 
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CHAPTER 5: 

Conclusions and Recommendations 


Two goals of the study described in this thesis support the larger NPS research project: the 
first goal is to implement RWLG, a search capability for web sites, and the second goal 
is to take a full Common Crawl archive as the input to be processed in one MR job. We 
essentially achieved both goals successfully in this study, as we had an almost successful 
execution, we had generated 99.9% of usable result in the second goal, and we know exactly 
how to adjust the settings to complete the execution; however, there are still possible avenues 
to be explored, such as implementing other algorithms or making use of the output from 
this work. We summarize our contributions and provide recommendations for future work 
in this chapter. 

5.1 Conclusions 

The implementation of RWLG was accomplished successfully, and the algorithm was 
almost successfully executed on a full 2016 Common Crawl archive. We also extended the 
RWLG algorithm to be capable of generating link pairs that are more than one hop away, 
with both analytical proof and experimental validation provided. Using [2] as a building 
block, we shared most of the same tools in this study. Other than providing the powerful 
RWLG as a search capability, MR programs that can categorize, sort, and compare the 
stored value pairs to utilize the resulting output were also discussed. By working with 
low-level, relatively stable CLI commands in this study, we were able to run these programs 
on AWS independently from its GUI interface, which may change from time to time. 

The existing YARN has been successfully leveraged to better support the extremely large 
data sets processed for this thesis research. Through the systematic approach, we achieved 
the goal of taking a full Common Crawl archive as input to be processed in a single MR 
job by identifying and setting the suitable environment variables for HMR components. 
Providing the insight of how to track down an error to the original cause, we can use this 
thesis to serve as a guide to adjust the corresponding variables to achieve a faster execution 
time while successfully finishing different MR jobs. 
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5.2 Recommendations 

Although we essentially achieved both goals, there are still many possibilities for further 
research. The possible areas to explore are listed below. 


1. Design and implement new search capabilities 

Previous work provided the inverted index, and this work provides RWLG. There 
are still different search capabilities to be designed and implemented to help find the 
desired information efficiently. Other than designing a new search capability, another 
path for future work can combine the results of the inverted index and RWLG to offer 
more precise information for any preferred application. 

2. Move to IaaS instead of PaaS 

As stated in Chapter 4, there are downsides for PaaS. Not only did we lose data once 
an error occurred, we noted that low level system details are hidden from the user. If 
we switch to EC2 to use a set of instances that are fully under our control, we would 
be able to eliminate this problem. Further, we are only allowed to use the same type 
of instances for deploying mappers and reducers. This is no longer true if we switch 
to EC2, in which we can freely choose different types of instances that fit our needs 
and there are more kinds of instances from which to choose. In IaaS, we assume 
the primary responsibility is in configuring the operating system, Hadoop, and the 
associated labor. 

3. Make use of the result of this study 

We now have the RWLG output of a full archive and want to make use of it. One way 
to use these value pairs is to write a search program that takes an operator’s input and 
uses the RWLG data produced in this work to return a list of source URLs that link to 
the input URL(s) in either text or graphic form. The program can also take a number 
n as a parameter to produce the 1-to-n-hop link graph centered on the input URL. 

4. Process, analyze and compare the output of this study 

We are now capable of processing a full archive at a time, so we can analyze the 
different outputs using different archives. We can observe the numbers of the inward 
and outward links of some popular websites and compare our result with other 
companies’ such as Alexa or Wayback Machine (Alexa, which can be found at 
http://www.alexa.com/, is a company from amazon.com and keeps track of traffics, 
statistics and analyses of different websites; Wayback Machine, which can be found 
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at http://archive.org/web/, is a service from Internet Archive that has saved different 
versions of web pages for 20 years [32]). Although these websites do not generate 
RWLGs as was done in this thesis, we can still determine if these companies’ results 
are related to ours by observing the deviation in links from and to different well-known 
websites over the past decades. 
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APPENDIX A: 
Hadoop Configuration Files 


The suggested source listings for the seven files that are required for Hadoop configu¬ 
ration are provided in this appendix. These seven files should be under /usr/local/Cel- 
lar/hadoop/[Version #]/libexec/etc/hadoop if they exist. Create one if needed. 

A.l hadoop-env.sh 

Append the following three lines to the end of the file: 

export JAVA_HOME="$ (/ usr / libexec / java_home )" 

export HADOOP_OPTS="${HADOOP_OPTS} -Djava . security . krb5 . 

realms -Djava. security. krb5 . kdc = " 
export HADOOP_OPTS="${HADOOP_OPTS} -Djava . security . krb5 . 
conf =/dev / null " 


A.2 mapred-env.sh 

Append the following line to the end of the file: 
export JAVA_HOME="$ (/ usr / libexec / java_home )" 

A.3 yarn-env.sh 

Append the following line to the end of the file: 

ARN_OPTS =" $YARN_OPTS -Djava. security . krb5 . realm=OX. AC .UK - 
Djava. security . krb5 . kdc=kdcO . ox . ac . uk : kdc 1 .ox.ac.uk" 

A.4 core-site.xml 

Rewrite the file content as the following code: 
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<configuration> 

< property > 

< name >h ado op . tmp . dir < / name> 

< value >/tmp / hadoop -${ user . name }</ value > 

< description >Temporary base dir ec t or ie s . < / des crip tion > 

< / property > 

< property > 

<name>fs . default . name</name> 

<value>hdfs: / / localhost:9000< / value> 

< / property > 

< / configuration> 


A.5 hdfs-site.xml 

Rewrite the file content as the following code: 

<configuration> 

< property > 

<name>dfs . r ep 1 ic a t i o n </name> 
<value>l < /value> 

< / property > 

< / configuration> 


A.6 mapred-site.xml 

Rewrite the file content as the following code: 

<configuration> 

< property > 

<name>mapred .job . tracker</ name> 
<value>localhost:9001</value> 

< / property > 

< / configuration> 
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A.7 yarn-site.xml 

Rewrite the file content as the following code: 

<configuration> 

< property > 

<name>yarn . resourcemanager . resourcetracker . address< /name> 
<value>$resourcemanager. full .hostname:8025</value> 

< de s crip ti o n > Enter your ResourceManager hostname.</ 

description > 

< / property > 

< property > 

<name>yarn . resourcemanager . scheduler . address </name> 
<value>$resourcemanager. full .hostname:8030</value> 

< de s crip ti o n > Enter your ResourceManager hostname.</ 

description > 

< / property > 

< property > 

<name>yarn . resourcemanager . address </name> 

< value >$resourcemanager . full . hostname:8050</value> 

< de s crip ti o n > Enter your ResourceManager hostname.</ 

description > 

< / property > 

< property > 

<name>yarn . resourcemanager . admin . address</name> 
<value>$resourcemanager. full .hostname:8041</value> 

< de s crip ti o n > Enter your ResourceManager hostname.</ 

description > 

< / property > 

< property > 

< name > yarn .nodemanager, local- dirs</ name> 

<value>/grid / hadoop / hdfs / yarn , / grid 1 / hadoop / hdfs / yarn < / 

value > 

< description >Comma separated list of paths. Use the list 

of directories from $YARN_LOCAL_DIR. 
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For example, / grid / hadoop / hdfs / yarn ,/grid 1 / hadoop / hdfs / 
yarn.< / description > 

< / property > 

< property > 

<name>yarn .nodemanager.log-dirs</name> 

<value>/var/ log /hadoop/ yarn < / value > 

< description >Use the list of directories from $ 


YARN_LOG_DIR. 


For example, / var / log / hadoop / yarn .</de s crip tio n > 
< / property > 

< / configuration> 
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APPENDIX B: 

Basic Algorithm Source Listings 


Two sections are contained in this appendix. In the first section, the basic algorithm 
source listings described in Chapter 3 are provided, and the MR job source listings used to 
categorize the output value pairs are given in the second. Normally, the output is ordered 
by the keys in alphabetical order throughout all output files. When we want to look for a 
specific record, we have no idea where it is. The "categorizer" MR job first extracts the 
target’s domain name and then uses the Java built-in hashCode() function to get the hash 
code corresponding to the domain name. The reducer gathers these value pairs by their 
hash codes and writes these value pairs into corresponding files for which we can easily 
find the record by providing the target’s domain name. 


B.l Basic Algorithm Source Listings 

The MR code always has three parts: driver, mapper and reducer. Coudray’s thesis grouped 
them into two separated files, but I prefer to put each of them into different files. These 
three files are listed in the following sections. 

B.1.1 Driver (RWLGHTMLDriver.java) 

import org. apache, hadoop . conf. Configuration ; 
import org. apache, hadoop . conf. Configured ; 
import org . apache . hadoop . fs . Path ; 
import org . apache . hadoop . io . Text ; 
import org . apache . hadoop . mapreduce . Job ; 

import org . apache . hadoop . mapreduce . lib . input . FilelnputFormat ; 

import org . apache . hadoop . mapreduce . lib . output . FileOutputFormat ; 

import org . apache . hadoop . util . Tool ; 

import org . apache . hadoop . util . ToolRunner ; 
import nl . sur f s ar a . war cut il s . * ; 

public class RWLGHTMLDriver extends Configured implements Toolj 
public static void main( String [] args) throws Exception! 
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Configuration conf = new Configuration (); 

System . exit ( ToolRunner . run ( conf , new RWLGHTMLDriver () , 
args )) ; 

} 

@ Override 

public int run(String[] argO) throws Exception { 
Configuration conf = this . getConf () ; 

Job job = Job . getlnstance ( conf , "RIGHTMLDriver" ) ; 

// setup all classes 

job . setJarByClass (RWLGHTMLDriver. class) ; 
i f ( argO . length >2 && argO [2]. equals ( "no-dynamic ")) { 
job . setMap per Class (RWLGITTMlJVIapperND. class ) ; 

} 

else { 

job . setMap per Class (RWLGHTMLMapper. class ) ; 

} 

job . setReducerClass (RWLGHTMLReducer .class); 

// input/output setup 

job . setlnputLormatClass (WarcInputLormat . class ) ; 
job. setOutputKeyClass(Text. class) ; 
job . setOutputValueClass (Text . class ) ; 
job . setMapOutputValueClass (Text . class ) ; 

// path setup 

String inputPath = arg0[0]; 

String outputPath = arg0 [ 1 ]; 

LilelnputLormat . addlnputPaths (job , inputPath) ; 
FileOutputFormat . setOutputPath (job , new Path ( outputPath )) 
return job . waitForCompletion ( true ) ? 0 : 1; 
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} 

B.1.2 Mapper (RWLGHTMLMapper.java) 

import java . io . ByteArrayOutputStream ; 
import java . io . IOException ; 
import java . io . InputStream ; 
import j a v a . u t i 1 . HashSet ; 
import j a v a . u t i 1 . Set ; 

import org . apache . hadoop . io . LongWritable ; 

import org . apache . hadoop . io . Text ; 

import org . apache . hadoop . mapreduce . Mapper ; 

import org . j wat . ware . WarcRecord ; 
import org . j wat . common. Payload ; 

import org . j soup . Jsoup ; 

import org .jsoup . nodes . Document; 

import org .jsoup . nodes . Element ; 

public class RWLGHTMLMapper extends Mapper<LongWritable , 
WarcRecord, Text, Text>{ 

// Read from WARC file 

// output value pairs <target , sourcel >, <target , source2 >, 
// # Step 2, Algorithm 3-1 
@ Override 

public void nrap( LongWritable key, WarcRecord value, Context 
context) throws IOException, InterruptedException { 

// only check the http response 

if (value . header . warcTypeStr . equals ("response"))} 

Payload payload = value . getPayload () ; 
if(payload != null)} 

// get the HTML page into contStreanr => warcContent 
InputStream inStreanr = payload . getlnputStreanr () ; 
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ByteArrayOutputStream contStream = new 
ByteArrayOutputStream () ; 

// a trick to parse the ware record into String 
byte [] buf = new byte [1024]; 
int len ; 

while((len = inStream . read ( buf)) != -1) 
contStream . write ( buf , 0, len); 

String warcContent = contStream . toString ( "UTF-8" ). trim () 

if ( warcContent != null && ! warcContent . isEmpty ()) { 

// parse the HTML into objects 
Document doc = Jsoup . parse ( warcContent , 
value . header . warcTargetUriStr) ; 

Text source = new 

Text(value . header . warcTargetUriStr . trim () ) ; 

// check all links and combine with the URL of current 
page 

Set<Text> linkSet = new HashSet <Text > () ; 
for(Element e: doc . select ( "a [ href ]")) { 

// clear out characters that should not appear in 
the URL field 
String href = 

e. attr("abs:href").replaceAll(" [WtWnWrWs]*", 

"" ); 

if (! href . isEmpty () ) { 

Text target = new Text(href); 

// outputs the non-repeated value pairs 
if (linkS et . add ( tar ge t)) context. write(tar get , 
source); 


} 


} 
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B.1.3 Mapper for Non-Dynamic URLs (RWLGHTMLMapperND.java) 

import java . io . ByteArrayOutputStream ; 
import java . io . IOException ; 
import java . io . InputStream ; 
import j a v a . u t i 1 . HashSet ; 
import j a v a . u t i 1 . Set ; 

import org . apache . hadoop . io . LongWritable ; 

import org . apache . hadoop . io . Text ; 

import org . apache . hadoop . mapreduce . Mapper ; 

import org . jwat . ware . WarcRecord ; 
import org.jwat. common. Payload ; 

import org . j soup . Jsoup ; 

import org .jsoup . nodes . Document; 

import org .jsoup . nodes . Element ; 

// This mapper discards the dynamic part of URLs 
public class RWLGtrTMLMapperND extends Mapper<LongWritable , 
WarcRecord, Text, Text>{ 

// Read from WARC file 

// output value pairs ctarget, sou re el >, ctarget, source2>, ... 

// # Step 2, Algorithm 3-1 
@ Override 

public void nrap( LongWritable key, WarcRecord value, Context 
context) throws IOException, InterruptedException { 
if (value . header . warcTypeStr . equals ( " response ")){ 

Payload payload = value . getPayload () ; 
if(payload != null) { 

InputStream inStreanr = payload . getlnputStreanr () ; 
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ByteArrayOutputStream contStream = new 
ByteArrayOutputStream () ; 
byte [] buf = new byte [1024]; 
int len ; 

while((len = inStream . read ( buf)) != -1) 
contStream . write ( buf , 0, len); 

String warcContent = contStream . toString ( "UTF-8" ). trim () 

if ( warcContent != null && ! warcContent . isEmpty ()) { 

String s = value . header . warcTargetUriStr . trim () ; 
Document doc = Jsoup . parse ( warcContent , s); 

// remove dynamic part in source 
if (s.contains( "?")){ 

String parts [] = s . split (" \\? ") ; 
s = parts [0] ; 

} 

Text source = new Text(s); 

Set<Text> linkSet = new HashSet <Text > () ; 

for(Element e: doc . s e le c t ( " a [ href ] " )) { 

String href = 

e. attr("abs:href").replaceAll(" [WtWnWrWs]*", 

"" ); 

if (! href . isEmpty () ) { 

// remove dynamic part in target 
if (href . contains ("?")) { 

String parts [] = href.split("\\?"); 
href = parts [0] ; 

} 

Text target = new Text(href); 

if (linkS et . add ( tar ge t)) context. write(tar get , 
source); 

} 
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B.1.4 Reducer (RWLGHTMLReducer) 

import java . io . IOException ; 
import j a v a . u t i 1 . Array s ; 
import j a v a . u t i 1 . HashSet ; 
import java . util . Iterator ; 
import java . util . Set ; 

import org . apache . hadoop . io . Text ; 

import org . apache . hadoop . mapreduce . Reducer ; 

public class RWLGHTMLReducer extends Reducer <Text , Text, Text 
Text >{ 

// Read value pairs ctarget , sourcel>, ctarget , source2>, . 
// output value pairs ctarget, list(sourcel), (target, 
source2) , ... 

// Step 3, Algorithm 3-1 

// also serve as step 5 reducer for Algorithm 3-2 
@ Override 

public void reduce(Text key, Iterable cText > values, Context 
context) throws IOException, InterruptedException { 

Iter ator cText > it = values . iter ator () ; 

Set<String> sources = new HashSet < S tring >() ; 

// clean up repeated and null/empties 

while(it.hasNext()) sources, add ( it . next() . toString ()) ; 

sources . remove (Arrays. asList("", null)); 


63 


Iter ator < String > i12 = sources . iter ator () ; 

StringBuilder strBdr = new StringBuilderQ; 

// append the first value 
strBdr . append( i12 . next () ) ; 

// append every other "| value" if there’s any 
while(it2.hasNext()) strBdr. append (" | ") . append(it2 . next () ) ; 

// outputs the value pair 

context . write (key , new Text ( strB dr . toS tring () ) ) ; 

} 

} 


B.2 Categorizer Source Listings 

These MR codes categorize the output files into different file names by the hashed value of 
their keys. As described before, these codes are separated in three different files as well. 

B.2.1 Driver (CategorizerDriver.java) 

import org. apache, hadoop . conf. Configuration ; 
import org. apache, hadoop . conf. Configured ; 
import org . apache . hadoop . fs . Path ; 

import org . apache . hadoop . io . Text ; 

import org . apache . hadoop . mapreduce . Job ; 

import org . apache . hadoop . mapreduce . lib . input . FilelnputFormat ; 

import org . apache . hadoop . mapreduce . lib . output . FileOutputFormat ; 

import org. apache, hadoop . util . Tool; 
import org . apache . hadoop . util . ToolRunner ; 

public class Categorizer Driver extends Configured implements 
Tool { 

public static void main( String [] args) throws Exception! 
Configuration conf = new Configuration (); 

System . exit ( ToolRunner . run ( conf , new CategorizerDriver () , 
args )) ; 
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} 


@ Override 

public int run( String [] argO) throws Exception { 
Configuration conf = this . getConf () ; 

Job job = Job . getlnstance ( conf , " CategorizerDriver " ) ; 

// setup all classes 

job . setJarByClass ( CategorizerDriver . class ) ; 
job. setM ap pe r Class (Categorizer Map per. class) ; 
job. setReducerClass(CategorizerReducer. class); 

// input/output setup 
job. setOutputKeyClass(Text. class) ; 
job . setOutputValueClass (Text . class ) ; 
job . setMapOutputValueClass (Text . class ) ; 

// path setup 

String inputPath = arg0[0]; 

String outputPath = argO [ 1 ]; 

FilelnputFormat . addInputPaths(job , inputPath) ; 
FileOutputFormat . setOutputPath (job , new Path ( outputPath )) 

return job . waitForCompletion ( true ) ? 0 : 1; 

} 

} 

B.2.2 Mapper (CategorizerMapper.java) 

import java.io.IOException; 

import java.net.Ma1formedURLException; 

import java.net.URL; 
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import org.apache.hadoop.io.LongWritable; 

import org.apache.hadoop.io.Text; 

import org.apache.hadoop.mapreduce.Mapper; 

public class CategorizerMapper extends Mapper<LongWritable, 

Text, Text, Text>{ 

// adjust the number of divisor (x) to get corresponding 
number of files (2x) per reducer 
private int space = (int) (((long) Integer.MAX_VALUE - 
(long) Integer.MIN_VALUE)/10); 

// Read from format: target <tab> list(source) 

// to (hashCode(target domain), target <tab> 
list(source)), ... 

public void map(LongWritable key, Text value, Context 
context) throws IOException, InterruptedException{ 

String [] token = value.toString().split ("\t") ; 

try { 

context.write (new Text(Integer.toString (new 

URL(token[Q].toString()).getHost().hashCode()/space)), 
value); 

} catch(MalformedURLException ex){ 
context.write (new 

Text(Integer.toString (" Mai formedURL" .hashCode()/space)), 
value); 

} 

} 

} 

B.2.3 Reducer (CategorizerReducer.java) 

import java.io.IOException; 
import java.uti1.Iterator; 
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import org.apache.hadoop.io.Text; 

import org.apache.hadoop.mapreduce.Reducer; 

import 

org.apache.hadoop.mapreduce.lib.output.MultipleOutputs; 

public class CategorizerReducer extends Reducer<Text, Text, 
Text , Text>{ 

private MultipleOutputs<Text, Text> mos; 

public void setup(Context context) throws IOException, 
InterruptedException{ 

mos = new MultipleOutputs<Text, Text>(context); 

} 

// input: <HashCode(target domain), <target <tab> 
list(sources)>> 

// output: <target, list(sources)> into files named by 
the hash code 
©Override 

public void reduce(Text key, Iterable<Text> values, 
Context context) throws IOException, 
InterruptedException { 

Iterator<Text> it = values.iterator(); 
while(it.hasNext()){ 

String [] token = it.next().toString().split ("\t") ; 
mos.write (new Text(token [S]) , new Text(token[1]), 
key.toString()); 

} 

} 

public void cleanup(Context context) throws IOException, 
InterruptedException! 
mos.close(); 
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APPENDIX C: 

Algorithm for Far Links Source Listings 


The algorithm for far links source listings is provided in this appendix. As described 
in Chapter 3, this algorithm concatenates two MR jobs, so this source file contains two 
mappers, two reducers, and a driver that defines two concatenated MR jobs. Since the result 
of the first MR job is temporarily saved in the cluster’s disk space, this program requires 
a lot of disk space to execute. We suggest that one only find the far link pairs for a fixed 
number of pairs. If the overall far link pairs are required, consider writing another driver to 
separate this program into two different ones and execute them separately. 

C.l Driver (L2Driver.java) 

import org. apache, hadoop . conf. Configuration ; 
import org. apache, hadoop . conf. Configured ; 

import org . apache . hadoop . fs . FileSystem ; 

import org . apache . hadoop . fs . Path ; 

import org . apache . hadoop . io . Text ; 

import org . apache . hadoop . mapreduce . Job ; 

import org . apache . hadoop . mapreduce . lib . input . FilelnputFormat ; 

import org . apache . hadoop . mapreduce . lib . output . FileOutputFormat ; 

import org . apache . hadoop . util . Tool; 

import org . apache . hadoop . util . ToolRunner ; 

public class L2Driver extends Configured implements Toolj 
public static void main( String [] args) throws Exception} 
Configuration conf = new Configuration!); 

System . exit ( ToolRunner . run ( conf , new L2Driver(), args)); 

} 

@ Override 

public int run( String [] argO) throws Exception { 

Configuration conf = this . getConf () ; 

FileSystem fs = FileSystem . get ( conf) ; 
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// jobl 

Job job = Job . getlnstance ( conf , "L2Driver"); 

// setup all classes 
job . setJarByClass (L2Driver . class ) ; 
job . setMapperClass (L2Mapper .class ) ; 
job . setReducerClass (L2Reducer. class ) ; 

// input/output setup 

job. setOutputKeyClass(Text. class) ; 

// path setup 

String inputPath = arg0[0]; 

String outputPath = argO[l]; 

String tempPath = "tempComp"; 

if(fs. exists (new Path (tempPath ))) 

fs . delete (new Path (tempPath ) , true); 

FilelnputFormat . addInputPaths(job , inputPath) ; 
FileOutputFormat . setOutputPath (job , new Path (tempPath )) 

job . waitForCompletion ( true ) ; 

// job2 

Job job2 = Job . getlnstance ( conf , " L2Compact" ) ; 

// setup all classes 

job2 . setJarByClass (L2Driver . class ) ; 

job2 . setMapperClass (L2CompactMapper .class ) ; 

job2 . setReducerClass(RWLGHTMLReducer. class) ; 

job 2 . setMapOutputKey Class (Text, class) ; 

FilelnputFormat . addlnputPath (job2 , new Path (tempPath )) ; 
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FileOutputFormat . setOutputPath (job2 , new Path ( outputPath )) ; 


return job2 . waitForCompletion ( true ) 7 0:1; 

} 

} 

C.2 First Mapper (L2Mapper.java) 

import java . io . IOException ; 

import org . apache . hadoop . io . LongWritable ; 

import org . apache . hadoop . io . Text ; 

import org . apache . hadoop . mapreduce . Mapper ; 

public class L2Mapper extends Mapper<LongWritable , Text, Text, 
Text >{ 

// input: target <tab> source 1 | source2 | source3 .. . 

// output: (target, (target, sourcel)), (sourcel, (target, 
source 1)), ... 

public void map( LongWritable key, Text value, Context context) 
throws IOException, InterruptedException { 

String [] token = value . toString (). split (" \ t ") ; 

String [ ] src = token [ 1 ]. split (" \\|" ) ; 

// Step 1 , Algorithm 3-2 
for (St ring s : src){ 

// Step 2, Algorithm 3-2 
if (s.equals(token[0])); 
else { 

// srcRequester 

context . write (new Text(s), new Text (token [0] + " \ t " + s )) ; 

// srcProvider 

context . write (new Text (token [0]) , new 
Text(token[0]+"\t" + s)) ; 

} 
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C.3 First Reducer (L2Reducer.java) 

import java . io . IOException ; 
import j a v a . u t i 1 . HashSet ; 
import java. util. Iterator; 
import java . util . Set ; 

import org . apache . hadoop . io . Text ; 

import org . apache . hadoop . mapreduce . Reducer ; 

// input : srcRequester & srcProvider 

// output: unordered <target , source> pairs that are 2-hops away 
// Step 3, Algorithm 3-2 

public class L2Reducer extends Reducer <Text , Text, Text, Text>{ 
@ Override 

public void reduce(Text key, Iterable <Text > values, Context 
context) throws IOException, InterruptedException { 

Iter ator <Text > it = values . iter ator () ; 

Set<String> srcProvider = new HashSet < String >() ; 

Set<String> srcRequester = new HashSet < S tring >() ; 
while (it . hasNext () ) { 

String [] tarsrc = it . next (). toString (). split (" \ t ") ; 

// check if srcProvider 

if ( tarsrc [0]. equals (key . toString () )) // should add these 
src to local list 
// add into list of sources 
srcProvider.add(tarsrc[1]); 
else // these are the ones request the list 
srcRequester . add (tarsrc [0]) ; 

} 
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// Step 4, Algorithm 3-2 
// output the unordered results 
for(String s: srcProvider) 
for(String t: srcRequester) 

if (! s . equals (t)) context . write (new Text(t), new Text(s)) 

} 

} 

C.4 Second Mapper (L2CompactMapper.java) 

import java . io . IOException ; 

import org . apache . hadoop . io . LongWritable ; 

import org . apache . hadoop . io . Text ; 

import org . apache . hadoop . mapreduce . Mapper ; 

public class L2CompactMapper extends Mapper<LongWritable , Text, 
Text, Text >{ 

// Read from format: target <tab> source 

// to (target, source), ... 

// Step 6 mapper , Algorithm 3-2 

public void map( LongWritable key, Text value, Context context) 
throws IOException, InterruptedException { 

String [] token = value . toString (). split (" \ t ") ; 
context . write (new Text (token [0]) , new Text (token [ 1 ])) ; 

} 

} 


C.5 Second Reducer (RWLGHTMLReducer.java) 

import j a v a . io . IOException ; 
import j a v a . u t i 1 . Array s ; 
import java . util . HashSet; 
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import java. util. Iterator; 
import j ava . u t i 1 . Set ; 

import org . apache . hadoop . io . Text ; 

import org . apache . hadoop . mapreduce . Reducer ; 

public class RWLGHTMLReducer extends Reducer <Text , Text, Text, 
Text >{ 

// Read value pairs <target, sourcel>, <target, source2>, .. 
// output value pairs ctarget , 1ist(source 1) , (target , 

source2), ... 

// Step 3, Algorithm 3-1 

// also serve as step 5 reducer for Algorithm 3-2 
@ Override 

public void reduce(Text key, Iterable <Text > values, Context 
context) throws IOException , InterruptedException { 

Iter ator <Text > it = values . iter ator () ; 

Set<String> sources = new HashSet < S tring > () ; 

// clean up repeated and null/empties 

while(it.hasNextQ) sources, add ( it . next() . toString ()) ; 
sources . remove (Arrays. asList("", null)); 

Iter ator < String > i 12 = sources . iter ator () ; 

StringBuilder strBdr = new StringBuilder () ; 

// append the first value 
strBdr . append( i12 . next () ) ; 

// append every other "| value" if there’s any 

while( i 12 . hasNextQ ) strBdr . append (" | ").append(it2 . next () ) 

// outputs the value pair 

context . write (key , new Text ( strB dr . toS tring ())) ; 
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75 



THIS PAGE INTENTIONALLY LEFT BLANK 


76 



APPENDIX D: 

Test and Result Files Listings 


The input file used in Section 3.3.3 and the resulting output are listed in this appendix. The 
reader may want to know that a value pair < key , value > in the input/output files for HMR 
is represented as "key <tab> value," and each pair occupies a line. We also represent the 
value of list < sourceURL > by "URL1|URL2|URL3|...". Each URL is concatenated with 
the character if more than one URL is in the list. 


D.l Test Input 

a b 
b c 
c d 
d e 
e f 

f 9 

g h 

h i 
i J 
j k 
k 1 
1 m 
m n 
n o 
o p 

P Q 
q r 
r s 
s t 
t u 
u v 
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V w 
W X 

x y 

V z 
z a 
aa b 
bb c 
cc d 
dd e 
ee f 

ff g 
gg h 

hh i 
ii J 
Jj k 
kk 1 
ab b 
be c 
cd d 
de e 
ef f 
ab bb 
be cc 
cd dd 
de ee 
ef ff 
a zz 
z yy 
y xx 
X ww 
w vv 

V uu 
u tt 
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t ss 
s rr 
r qq 
a zy 
z yx 
y xw 
x wv 
w vu 
zy c 
yx b 

yy b 

xw a 


D.2 Test Result 

a c 
aa c 
ab c 
b d 
bb d 
be d 
c e 
cc e 
cd e 
d f 
dd f 
de f 

e g 

ee g 
ef g 
f h 
ff h 
9 i 
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gg 

1 


h 

j 


hh 

j 


i 

k 


ii 

k 


j 

1 


j J 

1 


k 

m 


kk 

m 


1 

n 


m 

o 


n 

P 


o 

q 


P 

r 


q 

qq 1 s 


r 

rr 11 


s 

ss | u 


t 

tt | V 


u 

uu | w 


V 

vv | x | 

vu 

w 

ww | y | 

wv 

X 

XX | z | 

xw 

xw 

zz | b | 

zy 

y 

yy 1 a | 

yx 

yx 

c 


yy 

c 


z 

zz| b |zy 

zy 

d 
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APPENDIX E: 
Sorting MR Source Listings 


The sorting MR job source listings are contained in this appendix. The mapper returns the 
list of sources into separate value pairs, and the reducer sorts all the sources in alphabetical 
order. 


E.l Driver (SortDriver.java) 

import org. apache, hadoop . conf. Configuration ; 
import org. apache, hadoop . conf. Configured ; 
import org . apache . hadoop . fs . Path ; 

import org . apache . hadoop . io . Text ; 

import org . apache . hadoop . mapreduce . Job ; 

import org . apache . hadoop . mapreduce . lib . input . FilelnputFormat ; 

import org . apache . hadoop . mapreduce . lib . output . FileOutputFormat ; 

import org. apache, hadoop . util . Tool; 
import org . apache . hadoop . util . ToolRunner ; 

public class SortDriver extends Configured implements Toolj 
public static void main( String [] args) throws Exception} 
Configuration conf = new Configuration (); 

System . exit ( ToolRunner . run ( conf , new SortDriver () , args)); 

} 

@ Override 

public int run( String}] argO) throws Exception { 

// jobl 

Configuration conf = this . getConf () ; 

Job job = Job . getlnstance ( conf , "SortDriver"); 

// setup all classes 

job . setJarByClass ( SortDriver . class ) ; 
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job . setM ap per Class ( Sort Mapper .class ) ; 
job . setReducerClass ( SortReducer . class ) ; 

// input/output setup 
job. setOutputKeyClass(Text. class) ; 
job . setOutputValueClass (Text . class ) ; 
job . setMapOutputValueClass (Text . class ) ; 

// path setup 

String inputPath = arg0[0]; 

String outputPath = argO[l]; 

FilelnputFormat . addInputPaths(job , inputPath) ; 
FileOutputFormat . setOutputPath (job , new Path ( outputPath )) ; 

return job . waitForCompletion ( true ) ? 0 : 1; 

} 

} 


E.2 Mapper (SortMapper.java) 

import java . io . IOException ; 

import org . apache . hadoop . io . LongWritable ; 

import org . apache . hadoop . io . Text ; 

import org . apache . hadoop . mapreduce . Mapper ; 

public class SortMapper extends MappercLongWritable , Text, Text, 
Text >{ 

// Read from format: target <tab> source 1 | source2 | source3 .. . 

// to (target, sourcel), (target, source2), ... 
public void map( LongWritable key, Text value, Context context) 
throws IOException, InterruptedException { 

String[] token = value . toString (). split (" \ t ") ; 
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String [] source = token [ 1 ]. toString (). sp 1 it (" \\| ") ; 
for(String s: source) 

context . write (new Text (token [0]) , new Text(s)); 

} 

} 

E.3 Reducer (SortReducer.java) 

import java . io . IOException ; 
import j a v a . u t i 1 . Arr ay Li st ; 
import j a v a . u t i 1 . Array s ; 
import j a v a . u t i 1 . C ol le c t i on s ; 
import j a v a . u t i 1 . HashSet ; 
import java. util. Iterator; 
import j a v a . u t i 1 . Li s t ; 
import java . util . Set ; 

import org . apache . hadoop . io . Text ; 

import org . apache . hadoop . mapreduce . Reducer ; 

public class SortReducer extends Reducer <Text , Text, Text, Text>{ 

// Read value pairs <target , sourcel>, ctarget, source2>, ... 

// output value pairs ctarget , list(source) where the list is 
in alphabetical order 
@ Override 

public void reduce(Text key, Iterable cText > values, Context 
context) throws IOException, InterruptedException { 

Iter ator cText > it = values . iter ator () ; 

Set<String> sources = new HashSet < S tring >() ; 

// clean up repeated and null/empties 

while(it.hasNext()) sources, add ( it . next() . toString ()) ; 
sources . remove (Arrays. asList("", null)); 
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List<String> orderedSrc = new ArrayList < String >( sources ) ; 
Collections . sort (orderedSrc ) ; // sort the list here 

Iter ator < String > i12 = orderedSrc . iter ator () ; 

StringBuilder strBdr = new StringBuilderQ; 

// append the first value 
strBdr . append( i12 . next () ) ; 

// append every other "| value" if there’s any 

while( i 12 . hasNextQ ) strBdr . append (" | ").append(it2 . next () ) 

// outputs the value pair 

context . write (key , new Text ( strB dr . toS tring () )) ; 

} 

} 
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APPENDIX F: 

JavaScript Object Notation (JSON) File Format 


JSON [33] is a lightweight hie format that is easy for humans to read and write and is being 
used in AWS. We used JSON hies to save the environment variables in this study. As stated 
in Chapter 3 and Appendix A, there are seven hies for Hadoop conhguration, and we can 
rewrite part of each conhguration hie on EMR through a single JSON conhguration hie. A 
template of the JSON hie for Hadoop conhguration on EMR is provided here: 

[ 

{ 

"Classification": "[XML File]", 

"Properties" : { 

"[Variable Namel]": "[Valuel]" , 

"[Variable Name2]" : "[Value2]" 

} 

}, 

{ 

"Classification": "[SH File]", 

"Properties": {}, 

"Configurations": [ 

{ 

"Classification": "export", 

"Properties" : { 

"[Variable Name3]": "[Value3]" 

} 

} 

] 

} 

] 

The JSON hie always starts and ends with a pair of square brackets, and the conhguration 
for different hies is in a pair of curly brackets, separated by commas. There are two kinds of 
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files in those seven configuration files: XML and SH files. Each type of file must be used 
as shown in the template; only values in square brackets should be changed to the desired 
content. Misplaced commas are a common mistake. Make sure that commas are contained 
within the same brackets and that the last item in brackets is not followed by a comma. 
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APPENDIX G: 

AWS EMR Error Message Track Down 


In AWS, there are three different levels in which we can find an error message: cluster level, 
task level and container level. An example to illustrate the location of these error messages 
and how to track the container-level error message is used in this appendix. 


G.l Cluster-Level Error Message 

The cluster-level error message is the most obvious one. It can be found in the AWS GUI 
as shown below: 


1. First Method 

The first method to find a cluster-level error message is shown in Figure G.l and 
Figure G.2. 


Create cluster 


View details 


Clone 


Terminate 


Filter All clusters Q Filter clusters .. 

Name 

1. Expend the desired cluster information 


100 clusters (all loaded) 


▼ • 1 arc_c34xlarge 


Summary 

Master ec2-52-90-226-254.compute- 
public DNS: 1 .amazonaws.com 

Termination 
protection: Off 
Tags: ~ 


ID 


Status 


Creation time (UTC-8) 

w Elapsed time 

I-60EKXD00GIAW 

Terminated with errors 
Step failure 

2017-02-20 16:06 (UTC-8) 7 hours, 37 rr 


Steps 



View all interactive jobs 


Name 


Status 

Start time (UTC-8) ^ 

Elapsed time 


1 I 2 

1 1 arc 1 

Click on the failed step 

Failed 

2017-02-20 16:12 (UTC-8) 

7 hours, 28 minutes 


Setup Hadoop Debugging 

Completed 

2017-02-20 16:12 (UTC-8) 

2 seconds 


Figure G.l. First Method for Finding the Cluster-Level Error Message (1) 
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Figure G.2. First Method for Finding the Cluster-Level Error Message (2) 

2. Second Method 

The second method to find a cluster-level error message is shown in Figure G.3 and 
Figure G.4. 


Create cluster 


View details 


Clone 


Terminate 


Filter Al I cl usters jQ Filter clusters .. 

Name 

1. Click on the failed cluster name 


▼ • |1arc_c34xlarge | 

Summary _ 

Master ec2-52-90-226-254.compute- 
public DNS: 1 .amazonaws.com 
Termination 
protection: Off 
Tags: ~ 


ID 


100 clusters (all loaded) 

Status 


i-60EKXDOOGIAW Jemnmated wi'h e rr°rs 


Creation time (UTC-8) ^ 


2017-02-20 16:06 (UTC-8) 


Elapsed time 


7 hours, 37 nr 


View all Interactive jobs 

Name Status Start time (UTC-8) ^ Elapsed time 

larc Failed 2017-02-2016:12 (UTC-8) 7 hours, 28 minutes 

Setup Hadoop Debugging Completed 2017-02-2016:12 (UTC-8) 2 seconds 


Figure G.3. Second Method for Finding the Cluster-Level Error Message (1) 


► Monitoring 

► Hardware 


|t- Steps | 2 - Expand the Steps Information 

| Clone step Cancel step 

Steps View all interactive jobs I View all jobs 

Filter, aii steps _ Q n/ter steps... 2 steps (all loaded) 


ID 

Name 

Status 

Start time (UTC-8) w 

Elapsed time 

Log files 

3. These contain the cluster-level error mossago 

S-2AJ3UHTU8CTNU 

larc 

Failed 

2017-02-20 16:12 (UTC-8) 

7 hours, 28 minutes 

| controller | syslog | stderr | stdout* Cf 

S-PDRE92M5Z0KB 

Setup 

Hadoop 

Completed 

2017-02-20 16:12 (UTC-8) 

2 seconds 

View logs 


Debugging 


Figure G.4. Second Method for Finding the Cluster-Level Error Message (2) 
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3. Third Method 

The Third method to find a cluster-level error message is shown in Figure G.5, 
Figure G.6, and Figure G.7. 


Create cluster 


View details 


Clone 


Terminate 


Filter All clusters 

Name 


I Filter clusters.. 


t- • 1arc_c34xlarge 

Summary _ 

Master ec2-52-90-226-254.compute- 
public DNS: 1 .amazonaws.com 
Termination 
protection: Off 
Tags: ~ 


ID 


100 clusters (all loaded) 

Status 


Creation time (UTC-8) w 


1. Record the failed cluster ID 

| i-6QEKXDOOGIAW | 


Name 

larc 


Terminated with errors 

Step failure 


Status 

Failed 


2017-02-20 16:06 (UTC-8) 


Elapsed time 


7 hours, 37 nr 


Start time (UTC-8) ^ 

2017-02-20 16:12 (UTC-8) 


Setup Hadoop Debugging Completed 2017-02-20 16:12 (UTC-8) 


View all Interactive jobs 

Elapsed time 

7 hours, 28 minutes 
2 seconds 


Figure G.5. Third Method for Finding the Cluster-Level Error Message (1) 


► Monitoring 

► Hardware 
▼ Steps 

| Clone step Cancel step 

Steps View all interactive jobs I View all jobs 

Filter, i All steps _ fljfi/rer steps ... 2 steps (all loaded) 


ID 

Name 

Status 

Start time (UTC-8) w 

Elapsed time 

Log files 

IS 2AJ3UHTU8CTNU | 

larc 

Failed 

2017-02-20 16:12 (UTC-8) 

7 hours, 28 minutes 

controller | syslog | stderr | stdout* C 

S-PDRE92M5Z0KB 

Setup 

Hadoop 

Completed 

2017-02-20 16:12 (UTC-8) 

2 seconds 

View logs 


Debugging 

Figure G.6. Third Method for Finding the Cluster-Level Error Message (2) 
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Services v- Resource Groups v- 


3. Trace the log file path by the IDs recorded In the previous steps 

| Amazon S3 > aws-logs-270560560283-us-east-1 > elasticmapreduce > j-6QEKXDOOGIAW > steps > S-2AJ3UHTU8CTNU | 


Objects 


Properties 

Lifecycle 

Permissions 

Management 


Upload I + Create folder 


Deleted objects 


Q Type a prefix and press Enter to search. Press ESC to clear. 


Name tL Last modified TL Size tL 


4. Those contains the cluster-level error message 


□ 

[?) control ler.gz 

Feb 20, 201711:41:59 PM 

1.2 KB 

□ 

® stderr.gz 

Feb 20, 2017 11:38:29 PM 

609.0 B 

□ 

E syslog.2017-02-21 -OO.gz 

Feb 20, 2017 5:03:29 PM 

620.0 B 

□ 

E syslog.2017-02-21-01 .gz 

Feb 20, 2017 6:03:29 PM 

309.0 B 

□ 

E syslog.2017-02-21-02.gz 

Feb 20, 2017 7:03:29 PM 

266.0 B 

□ 

E syslog.2017-02-21-03.gz 

Feb 20, 2017 8:03:29 PM 

282.0 B 

□ 

E syslog.2017-02-21-04.gz 

Feb 20, 2017 9:03:29 PM 

276.0 B 

□ 

E syslog.2017-02-21-05.gz 

Feb 20, 2017 10:03:29 PM 

494.0 B 

□ 

E syslog.2017-02-21-06.gz 

Feb 20, 2017 11:03:29 PM 

673.0 B 


Figure G.7. Third Method for Finding the Cluster-Level Error Message (3) 


G.2 Task-Level Error Message 

1. First Method 

Continuing from the last step of the First Method to find the cluster-level error 
message, the first method to find a task-level error message is shown in Figure G.8, 
Figure G.9, and G.10. 
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larc 


Status: Failed Start time: 2017-02-20 16:12 (UTC-8) 

ID: S-2AJ3UHTU8CTNU Elapsed time: 7 hours, 28 minutes 

Log files: controller | syslog | stderr | stdout* C 

JAR location : s3://reversedweblinkgraph/RWLGMapReduce.jar 
Main class : None 

RWLGHTMLDriver s3://commoncrawl/crawl-data/CC-MAIN-2016-07/segments/*/warc/ 
Arguments : s3n://reversedweblinkgraph/output3/1arcfinal2/ 

Action on failure: Terminate cluster 

Jobs 

Jobs for: S-2AJ3UHTU8CTNU 


Filter: | 





Job 

State 

Start time (UTC-8) 

Actions 

1. Click to view tasks of failed lob 

jOb_1487635761459_0001 

FAILED 

2017-02-20 16:12 (UTC-8) 

| View tasks 



Figure G.8. First Method for Finding the Task-Level Error Message (1) 


Jobs > Tasks 

Tasks for S-2AJ3UHTU8CTNU, Job 1487635761459_0001 

Task summary: 35684 total tasks - 35681 completed, 0 running, 3 failed, 0 pending, 0 cancelled. 


Filter 


load more 




Task 

Type 

State 

Start time (UTC-8) 

2. All the attempts for failed 

Actions 

asks need to be rev 

iowed 

r_000600 

REDUCE 

FAILED 

2017-02-20 22:42:08 (UTC-8) 

View attempts 


r_000214 

REDUCE 

FAILED 

2017-02-20 21:45:45 (UTC-8) 

View attempts 


r_000005 

REDUCE 

FAILED 

2017-02-20 16:35:38 (UTC-8) 

View attempts 


r_000783 

REDUCE 

COMPLETED 

2017-02-20 23:09:36 (UTC-8) 

View attempts 


r_000782 

REDUCE 

COMPLETED 

2017-02-20 23:09:36 (UTC-8) 

View attempts 



Figure G.9. First Method for Finding the Task-Level Error Message (2) 


91 








Jobs > Tasks > Task attempts 

Attempts for: S-2AJ3UHTU8CTNU, Job 1487635761459_0001, Task r_000600 

Filter: 


Attempt 

TVpe 

State 

Log files 







2 

REDUCE 

FAILED 

controller* 

syslog | stderr* 

| stdout* C 

1 

REDUCE 

FAILED 

controller* 

syslog | stderr* 

| stdout* C 

0 

REDUCE 

FAILED 

controller* 

syslog | stderr* 

| stdout* C 

3 

REDUCE 

KILLED 

controller* 

syslog | stderr* 

| stdout* C 


Figure G.10. First Method for Finding the Task-Level Error Message (3) 

2. Second Method 

The second method also continues from the last step of the Second Method to find 
the cluster-level error message. It is shown in Figure G. 11. 


Steps 

Q 

| Clone step Cancel i 

step 




1. Click to view all jobs 

Steps 





View all interactive jobs | View all jobs | 

Filter 

Al 1 step s Q Filter steps 


2 steps (all loaded) 




ID 

Name 

Status 

Start time (UTC-8) ^ 

Elapsed time 

Log files 

► 

• S-2AJ3UHTU8CTNU 

larc 

Failed 

2017-02-20 16:12 (UTC-8) 

7 hours, 28 minutes 

controller | syslog | stderr | stdout* O 



Setup 





► 

S-PDRE92M5Z0KB 

Hadoop 

Completed 

2017-02-20 16:12 (UTC-8) 

2 seconds 

View logs 



Debugging 






Figure G.ll. Second Method for Finding the Task-Level Error Message 


The rest of the Second Method is the same as the First Method. 


G.3 Container-Level Log 

It is rare to need to look in the container-level log for an error; however, when either 
a cluster-level or a task-level error message indicates the container number (as shown in 
Figure G.12), it is the time to look for the container log to trace the cause of the error. 


Container [pid-5928,|containerID”Container_1487586726490_0001_01_041423] |is running beyond physical memory limits. Current usage: 1.0 GB of 1 GB physical memory used; 2.8 GB of 
5 GB virtual memory used. Rilling container. Container ID is Indicator 

Dump of the process-tree for container_1487586726490_0001_01_041423 : 

|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE 

j- 5928 5922 5928 5928 (bash) 0 0 115806208 690 /bin/bash -c /usr/lib/jvm/java-openjdk/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN 
Xmx864m -Djava.io.tmpdir-/mntl/yarn/usercache/hadoop/appcache/application_1487586726490_0001/container_1487586726490_0001_01_041423/tmp -Dlog4j.configuration-containcr- 
log4j.properties -Dyarn.app.container.log.dir-/var/log/hadoop-yarn/containers/application_1487586726490_0001/container_1487586726490_0001_01_041423 - 
Dyarn.app.container.log.filesize-0 -Dhadoop.root.logger-INFO,CLA -Dhadoop.root.logfile-syslog org.apache.hadoop.mapred.YarnChild 172.31.17.213 38133 

attempt_1487586726490_0001_m_007658_0 41423 l>/var/log/hadoop-yarn/containers/application_1487586726490_0001/container_1487586726490_0001_01_041423/stdout 2>/var/log/hadoop- 
yarn/containers/application_1487586726490_0001/container_1487586726490_0001_01_041423/stderr 

|- 5933 5928 5928 5928 (java) 31643 2059 2846990336 261697 /usr/lib/jvm/java-openjdk/bin/java -Djava.net.preferIPv4Stack-true -Dhadoop.metrics.log.level-WARN -Xmx864m - 
Djava.io.tmpdir-/mntl/yarn/usercache/hadoop/appcache/application_1487586726490_0001/container_1487586726490_0001_01_041423/tmp -Dlog4j.configuration-container-log4j.properties 
-Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1487586726490_0001/container_1487586726490_0001_01_041423 -Dyarn.app.container.log.filesize-0 - 
Dhadoop.root.logger-INFO,CLA -Dhadoop.root.logfile-syslog org.apache.hadoop.mapred.YarnChild 172.31.17.213 38133 attempt_1487586726490_0001_m_007658_0 41423 

Figure G.12. Information about the Container-Level Log 
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The log files are all located in S3. We must go to the folder of the specified container, as 
shown in Figure G.13. 


Amazon S3 > aws-logs-270560560283-us-east-1 > elasticmapreduce > j-CDSI5A6E98IJ > containers > application,! 487586726490,0001 


Using the clusterlO as rclerencc to find the container folder. There should be only one application folder with all container Information In It. 


Properties 

Lifecycle 

Permissions 

Management 






+ Create folder | 

More v 

E 

| Deleted objects 



US East (N. Virginia) C 


Q 

Type a prefix and press Enter to search. Press ESC to clear. 



Enter the container ID here to find the specified container 







Viewing 1 to 100 > 

□ 

Name T 3. 

Last modified t L 

Size t r. 

Storage class T L 

□ 

& container^ 487586726490_0001 _01 _000001 

- 

- 

-- 

□ 

b- container,! 487586726490_0001 _01 _000002 

- 

- 

- 

□ 

container,! 487586726490,0001,01,000003 

- 

- 

~ 


Figure G.13. How to Find the Specified Container Folder 


After we get to the specified container, we are able to access the container-level log to sort 
out the problem in the failed execution, as shown in Figure G.14. 


□ 

□ 


Name 


© syslog.gz 


Last modified Tz. 

Feb 20, 2017 3:41:47 AM 


Open the file to view the container-level log message 


Size t L Storage class 

1.3 KB Standard 


Figure G.14. Content of the Desired Container Folder 
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